nebiolabs / domainator

A flexible and modular software suite for domain-based gene neighborhood and protein search, extraction, and clustering.
Other
11 stars 0 forks source link

faster sequence file (GenBank) parsing #1

Open seanrjohnson opened 3 months ago

seanrjohnson commented 3 months ago

GenBank file parsing is a major bottleneck for domain_search.py on large databases. The current GenBank parser is a fork of the BioPython GenBank parser, which is pure python, uses some regexes, and is slow. It would be great to integrate something like the rust parser: https://github.com/althonos/gb-io.py

A complication is that Domainator internals are quite reliant on BioPython SeqRecord objects, which might be hard to interface with or replicate with a faster genbank parser.