covsonar 2 runtime (and memory usage?)

matthuska commented 1 year ago

It would be great if covsonar 2 was faster than covsonar 1, but we don't expect that to be the case because covsonar 2 is much more flexible than 1. Nevertheless, covsonar 2 has to be fast enough to be useful for us.

The following commands have to run in a reasonable amount of time* (and with a reasonable amount of memory?):

[ ] extract all metadata and mutation profiles from a large database (~16M sequences in GISAID global 2023-08-31)
[ ] add a large number of new sequences and metadata to a new database
[ ] add a large number of sequences and metadata to a database, most of which are already present in the database
[ ] extract (or count) sequences that match a given genomic profile with a set of mutations
[ ] extract (or count) sequences that match a given lineage and all sublineages
[ ] delete a small number of sequences from a large database
where reasonable is defined as < 1.5x the runtime of covsonar 1, or in a fixed amount of time that is deemed reasonable

matthuska commented 1 year ago

In case it's useful in the future, I profiled the addition of 10 sequences to the current covsonar2 version using pyinstrument. Nothing to do here, just wanted to keep it somewhere in case we need to optimize this process at some point. It looks like alignment takes ~25 seconds out of 42 seconds total, with the remaining time split equally between cigar_parse and lift_vars:

Program: sonar import --threads 1 --db output/covsonar2.db --fasta seqs-10.fasta --no-progress

41.884 <module>  sonar:2
├─ 40.941 main  covsonar/sonar.py:1100
│  └─ 40.934 execute_commands  covsonar/sonar.py:1058
│     └─ 40.929 handle_import  covsonar/sonar.py:718
│        └─ 40.929 import_data  covsonar/utils.py:549
│           └─ 40.914 _import_fasta  covsonar/utils.py:748
│              └─ 40.693 sonarAligner.process_cached_sample  covsonar/align.py:260
│                 ├─ 25.939 sonarAligner.align  covsonar/align.py:56
│                 │  └─ 25.872 sg_trace_striped_32  parasail/bindings_v2.py:3429
│                 ├─ 7.267 <listcomp>  covsonar/align.py:303
│                 │  └─ 7.265 sonarAligner.lift_vars  covsonar/align.py:403
│                 │     └─ 7.119 sonarAligner.update_nuc_positions  covsonar/align.py:343
│                 │        ├─ 4.205 Series.between  pandas/core/series.py:5411
│                 │        │     [14 frames hidden]  pandas
│                 │        ├─ 2.400 _LocIndexer.__setitem__  pandas/core/indexing.py:831
│                 │        │     [10 frames hidden]  pandas
│                 │        └─ 0.438 DataFrame.__getitem__  pandas/core/frame.py:3713
│                 ├─ 6.027 sonarAligner.parse_cigar  covsonar/align.py:83
│                 │  └─ 6.013 handle_deletion  covsonar/align.py:176
│                 │     └─ 6.013 is_frameshift_del  covsonar/align.py:119
│                 │        └─ 5.835 DataFrame.groupby  pandas/core/frame.py:8130
│                 │              [29 frames hidden]  pandas
│                 └─ 1.442 Result.__del__  parasail/bindings_v2.py:273
└─ 0.911 <module>  covsonar/sonar.py:5
   └─ 0.497 <module>  covsonar/cache.py:5
      └─ 0.441 <module>  pandas/__init__.py:1

matthuska commented 1 year ago

Closed because we do not plan to continue covsonar 2 development.

In summary the performance was much worse than covsonar 1, and some work was put into improving that situation (see #110) but was abandoned to switch to a different solution using PostgreSQL.

rki-mf1 / covsonar

covsonar 2 runtime (and memory usage?) #98