Closed matthuska closed 1 year ago
In case it's useful in the future, I profiled the addition of 10 sequences to the current covsonar2 version using pyinstrument. Nothing to do here, just wanted to keep it somewhere in case we need to optimize this process at some point. It looks like alignment takes ~25 seconds out of 42 seconds total, with the remaining time split equally between cigar_parse
and lift_vars
:
Program: sonar import --threads 1 --db output/covsonar2.db --fasta seqs-10.fasta --no-progress
41.884 <module> sonar:2
├─ 40.941 main covsonar/sonar.py:1100
│ └─ 40.934 execute_commands covsonar/sonar.py:1058
│ └─ 40.929 handle_import covsonar/sonar.py:718
│ └─ 40.929 import_data covsonar/utils.py:549
│ └─ 40.914 _import_fasta covsonar/utils.py:748
│ └─ 40.693 sonarAligner.process_cached_sample covsonar/align.py:260
│ ├─ 25.939 sonarAligner.align covsonar/align.py:56
│ │ └─ 25.872 sg_trace_striped_32 parasail/bindings_v2.py:3429
│ ├─ 7.267 <listcomp> covsonar/align.py:303
│ │ └─ 7.265 sonarAligner.lift_vars covsonar/align.py:403
│ │ └─ 7.119 sonarAligner.update_nuc_positions covsonar/align.py:343
│ │ ├─ 4.205 Series.between pandas/core/series.py:5411
│ │ │ [14 frames hidden] pandas
│ │ ├─ 2.400 _LocIndexer.__setitem__ pandas/core/indexing.py:831
│ │ │ [10 frames hidden] pandas
│ │ └─ 0.438 DataFrame.__getitem__ pandas/core/frame.py:3713
│ ├─ 6.027 sonarAligner.parse_cigar covsonar/align.py:83
│ │ └─ 6.013 handle_deletion covsonar/align.py:176
│ │ └─ 6.013 is_frameshift_del covsonar/align.py:119
│ │ └─ 5.835 DataFrame.groupby pandas/core/frame.py:8130
│ │ [29 frames hidden] pandas
│ └─ 1.442 Result.__del__ parasail/bindings_v2.py:273
└─ 0.911 <module> covsonar/sonar.py:5
└─ 0.497 <module> covsonar/cache.py:5
└─ 0.441 <module> pandas/__init__.py:1
Closed because we do not plan to continue covsonar 2 development.
In summary the performance was much worse than covsonar 1, and some work was put into improving that situation (see #110) but was abandoned to switch to a different solution using PostgreSQL.
It would be great if covsonar 2 was faster than covsonar 1, but we don't expect that to be the case because covsonar 2 is much more flexible than 1. Nevertheless, covsonar 2 has to be fast enough to be useful for us.
The following commands have to run in a reasonable amount of time* (and with a reasonable amount of memory?):
[ ] extract all metadata and mutation profiles from a large database (~16M sequences in GISAID global 2023-08-31)
[ ] add a large number of new sequences and metadata to a new database
[ ] add a large number of sequences and metadata to a database, most of which are already present in the database
[ ] extract (or count) sequences that match a given genomic profile with a set of mutations
[ ] extract (or count) sequences that match a given lineage and all sublineages
[ ] delete a small number of sequences from a large database
where reasonable is defined as < 1.5x the runtime of covsonar 1, or in a fixed amount of time that is deemed reasonable