obophenotype / ncbitaxon

Build for NCBITaxon
BSD 3-Clause "New" or "Revised" License
25 stars 7 forks source link

Reimplement in Python #32

Closed jamesaoverton closed 4 years ago

jamesaoverton commented 4 years ago

Resolves #30:

I'm trying to make this exactly the same as current output using old taxdmp.zip files from this archive: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/

The main difference from the current behaviour is that we no longer have the "misspelling" synonyms that were coming from the EBI's taxonomy.dat. We are also picking up a few more PubMed IDs, although I'm not sure why. We are no longer asserting merges into deleted nodes -- we should probably consider how we handle deleted nodes.

This also replicates problems/limitations of the previous code:

Feedback appreciated.

jamesaoverton commented 4 years ago

It looks like sometime between 2018-01-01 and 2019-01-01 the NCBI stopped including misspellings in their names.dmp file. The current ncbitaxon.owl files includes misspellings, so it must have been using an older version of taxdmp.zip.

jamesaoverton commented 4 years ago

In order to ensure my new code is right, I'm trying to make it build exactly the same output as the current ncbitaxon.owl. The current ncbitaxon.owl includes a dozen misspellings of "Homo sapiens" for NCBITaxon:9606. But I cannot find these annotations in the taxdmp.zip files going back to 2014-08-01. There are other misspellings in the taxdmp.zip files until some point in 2018, when they are all removed.

We don't really need these misspellings in our ncbitaxon.owl file, but this is boggling my mind. Where did these "Homo sapiens" misspellings come from? Have we really been using a taxdmp.zip file from before 2014 this whole time?

jamesaoverton commented 4 years ago

Answering my own question: The current misspellings of "Homo sapiens" are coming from EBI's taxonomy.dat. I don't know where they get them from.

jamesaoverton commented 4 years ago

With this last commit I have "identical" outputs verified for a few taxa. "Identical" means exactly the same OWL coming out, except for the misspellings (which are not actually in the taxdmp.zip and so not in my new output) and without some of the merge assertions (for now).

jamesaoverton commented 4 years ago

This version covers the ~40 important OBO taxa from the "revised" sheet of this Google Sheet: https://docs.google.com/spreadsheets/d/16D7l0G-DL1Liv7yYFYVBEgRNpuNepCQCZoQTLYXXorA

jamesaoverton commented 4 years ago

This version covers ~700 taxa in the "unique" sheet of that Google Sheet.

jamesaoverton commented 4 years ago

I'm planning to do some cleanup tomorrow, but this code has the right output.

@cmungall Please assign somebody to set up a new Jenkins job, build your subsets, etc. based on this code. It would also be great to test it in working systems.

jamesaoverton commented 4 years ago

I'm done my cleanup.