nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

`augur merge` is slow to read in metadata #1628

Open tsibley opened 2 months ago

tsibley commented 2 months ago

based on my comment on the initial augur merge PR

augur merge is stupidly slow for tiny datasets, e.g. a couple seconds. That's due to Augur's own slow startup time and having to wait for that 2+n times, where n is the number of metadata tables being joined. On large datasets, this fixed startup time shouldn't matter, but on small datasets it feels really dumb. Cutting out the additional startup times by cutting out the use of augur read-file and augur write-file makes it quite quick, as it should be. However, augur {read,write}-file are important for proper and robust handling of newlines and compression formats and can't be jettisoned without significant additional work. More to the point, we don't have to do that work (and take on the additional complexity) if we make other improvements.

Improvements we can/should make: