Add label lookup to improve nexusreader speed

SamStudio8 commented 3 years ago

When parsing the TAXLABELS block in NEXUS formatted trees, the parser checks whether each token already exists in the taxon namespace with get_taxon. However, get_taxon requires a linear scan of the taxon namespace, which becomes increasingly expensive as the namespace grows in size. In particular, if reading a tree into an empty namespace, each taxon insertion will incur increasingly poor worst case search performance as the taxon is not in the tree.

This change modifies _parse_taxlabels_statement: before the token parsing loop, in a similar manner to how _lookup_label works, the taxon namespace is enumerated and each label (with case sensitivity considered) is added to a set.

As each token is parsed, it is checked in constant time against the set, rather than requiring a linear scan of the taxon namespace with get_taxon. If the token is not in the set, a new taxon is added as before with new_taxon. If the token is in the set, it can be safely fetched with get_taxon.

This change significantly improves performance on large NEXUS trees. As part of the COG-UK phylogenetics pipeline, this patch reduces the time to load a tree containing 340,000 TAXLABELS from two hours to approximately five minutes.

The change does not degrade the results of the current test suite. Additionally I have provided two new tests and test data to ensure that the new behaviour of _parse_taxlabels_statement is stable when attaching a taxon namespace to a Tree and reading in multiple NEXUS files with case sensitivity set to True and False.

SamStudio8 commented 3 years ago

For additional background, please can I refer you to https://github.com/COG-UK/dipi-group/issues/21#issuecomment-777392171. I hope this is useful!

jeetsukumaran commented 3 years ago

THANK YOU!

Really nice improvement! Also really nice and well-considered extension of the test suite.

All very much appreciated!

SamStudio8 commented 3 years ago

Thanks for taking a look at it so quickly! I don't suppose you have a roadmap for when >=4.5.2 will be pushed to pyPI?

jeetsukumaran commented 3 years ago

I've been meaning to setup a GitHub action to do this automatically on tagged pushes, as per https://packaging.python.org/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/ . But have not got around to it. If there is any urgency, let me know, and I will manually publish the latest stable to pyPI.

jeetsukumaran commented 3 years ago

Well, turns out it was much less work to manually push a release, so there you go!

SamStudio8 commented 3 years ago

Thanks for putting that link on my radar, I should do that on my own repositories too. I'd love for us to be able to use this new code from dendropy directly, so if you're offering to do a manual push to pyPI that would be really appreciated by our pipelines group at COG!

On Fri, Feb 12, 2021 at 10:46 AM Jeet Sukumaran notifications@github.com wrote:

I've been meaning to setup a GitHub action to do this automatically on tagged pushes, as per https://packaging.python.org/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/ . But have not got around to it. If there is any urgency, let me know, and I will manually publish the latest stable to pyPI.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeetsukumaran/DendroPy/pull/129#issuecomment-778121117, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN6ORN2VGG4I5Z3J74E6TS6UBKDANCNFSM4XOV5TBA .

SamStudio8 commented 3 years ago

Well, turns out it was much less work to manually push a release, so there you go!

Thank you so much!

jeetsukumaran / DendroPy

Add label lookup to improve nexusreader speed #129