mhoban / rainbow_bridge

GNU General Public License v3.0
5 stars 2 forks source link

Add taxid of last non-"dropped" taxon to collapsed taxonomy #83

Open mhoban opened 2 months ago

mhoban commented 2 months ago

It would be useful to have the NCBI taxid of the last non-"dropped" taxon added to the collapsed taxonomy table. There was some attempt to do this already, but it turns out to be tricky. We can't just join by name, since names may be repeated across (super)kingdoms and/or for things like subgenera and we don't necessarily have the higher levels to use as join criteria (unless we want to go row-by-row, which would be wicked slow).

mhoban commented 2 months ago

I suppose we could filter out only levels within the dumb kids playing catch, etc. hierarchy (excluding things like subgenera) and just use domain and/or kingdom as higher-level join criteria in addition to the lowest non-dropped name itself.

mhoban commented 2 months ago

See #80 for the offending code section

mhoban commented 2 months ago

This can be done using the nodes.dmp file from the NCBI taxonomy dump. We'll Join in nodes.dmp and filter it by only kingdom:species ranks.

This requires the get_lineage process to also unzip nodes.dmp, which inspires me to reuse the get_model code for downloading arbitrary stuff from a URL and then a get_zip process to extract specific files from a a zip archive in parallel. (or something like this)