Open mhoban opened 3 months ago
I suppose we could filter out only levels within the dumb kids playing catch, etc. hierarchy (excluding things like subgenera) and just use domain and/or kingdom as higher-level join criteria in addition to the lowest non-dropped name itself.
See #80 for the offending code section
This can be done using the nodes.dmp
file from the NCBI taxonomy dump. We'll Join in nodes.dmp
and filter it by only kingdom:species ranks.
This requires the get_lineage
process to also unzip nodes.dmp
, which inspires me to reuse the get_model
code for downloading arbitrary stuff from a URL and then a get_zip
process to extract specific files from a a zip archive in parallel.
(or something like this)
It would be useful to have the NCBI taxid of the last non-"dropped" taxon added to the collapsed taxonomy table. There was some attempt to do this already, but it turns out to be tricky. We can't just join by name, since names may be repeated across (super)kingdoms and/or for things like subgenera and we don't necessarily have the higher levels to use as join criteria (unless we want to go row-by-row, which would be wicked slow).