Closed jamesaoverton closed 4 years ago
It looks like sometime between 2018-01-01 and 2019-01-01 the NCBI stopped including misspellings in their names.dmp
file. The current ncbitaxon.owl
files includes misspellings, so it must have been using an older version of taxdmp.zip
.
In order to ensure my new code is right, I'm trying to make it build exactly the same output as the current ncbitaxon.owl
. The current ncbitaxon.owl
includes a dozen misspellings of "Homo sapiens" for NCBITaxon:9606. But I cannot find these annotations in the taxdmp.zip
files going back to 2014-08-01. There are other misspellings in the taxdmp.zip
files until some point in 2018, when they are all removed.
We don't really need these misspellings in our ncbitaxon.owl
file, but this is boggling my mind. Where did these "Homo sapiens" misspellings come from? Have we really been using a taxdmp.zip
file from before 2014 this whole time?
Answering my own question: The current misspellings of "Homo sapiens" are coming from EBI's taxonomy.dat
. I don't know where they get them from.
With this last commit I have "identical" outputs verified for a few taxa. "Identical" means exactly the same OWL coming out, except for the misspellings (which are not actually in the taxdmp.zip
and so not in my new output) and without some of the merge assertions (for now).
This version covers the ~40 important OBO taxa from the "revised" sheet of this Google Sheet: https://docs.google.com/spreadsheets/d/16D7l0G-DL1Liv7yYFYVBEgRNpuNepCQCZoQTLYXXorA
This version covers ~700 taxa in the "unique" sheet of that Google Sheet.
I'm planning to do some cleanup tomorrow, but this code has the right output.
@cmungall Please assign somebody to set up a new Jenkins job, build your subsets, etc. based on this code. It would also be great to test it in working systems.
I'm done my cleanup.
Resolves #30:
taxdmp.zip
from NCBIsrc/ncbitaxon.py
build/taxdmp.zip
names.dmp
,merged.dmp
, andcitations.dmp
into memorynodes.dmp
and appends Turtle strings tosrc/prologue.ttl
.owl
and.obo
formats; with 16GB of memory allocated, this takes me about ~6 minutes per file.travis.yml
which didn't seem to be doing anything usefulI'm trying to make this exactly the same as current output using old
taxdmp.zip
files from this archive: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/The main difference from the current behaviour is that we no longer have the "misspelling" synonyms that were coming from the EBI's
taxonomy.dat
. We are also picking up a few more PubMed IDs, although I'm not sure why. We are no longer asserting merges into deleted nodes -- we should probably consider how we handle deleted nodes.This also replicates problems/limitations of the previous code:
medline_id
as a PubMed ID (but allpubmid_id
s are "0" in the source data)Feedback appreciated.