Reimplement in Python - Githubissues

jamesaoverton commented 4 years ago

Resolves #30:

fetch taxdmp.zip from NCBI
run src/ncbitaxon.py
- reads directly from build/taxdmp.zip
- reads names.dmp, merged.dmp, and citations.dmp into memory
- iterates over nodes.dmp and appends Turtle strings to src/prologue.ttl
- this takes ~30 seconds
use ROBOT to convert to .owl and .obo formats; with 16GB of memory allocated, this takes me about ~6 minutes per file
remove .travis.yml which didn't seem to be doing anything useful

I'm trying to make this exactly the same as current output using old taxdmp.zip files from this archive: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/

The main difference from the current behaviour is that we no longer have the "misspelling" synonyms that were coming from the EBI's taxonomy.dat. We are also picking up a few more PubMed IDs, although I'm not sure why. We are no longer asserting merges into deleted nodes -- we should probably consider how we handle deleted nodes.

This also replicates problems/limitations of the previous code:

deleted nodes are dropped, not obsoleted
a bunch of weird IRIs for ranks and annotation properties
kinda weird handling of unique names: only use them when there's a conflict
treating medline_id as a PubMed ID (but all pubmid_ids are "0" in the source data)

Feedback appreciated.

jamesaoverton commented 4 years ago

It looks like sometime between 2018-01-01 and 2019-01-01 the NCBI stopped including misspellings in their names.dmp file. The current ncbitaxon.owl files includes misspellings, so it must have been using an older version of taxdmp.zip.

jamesaoverton commented 4 years ago

In order to ensure my new code is right, I'm trying to make it build exactly the same output as the current ncbitaxon.owl. The current ncbitaxon.owl includes a dozen misspellings of "Homo sapiens" for NCBITaxon:9606. But I cannot find these annotations in the taxdmp.zip files going back to 2014-08-01. There are other misspellings in the taxdmp.zip files until some point in 2018, when they are all removed.

We don't really need these misspellings in our ncbitaxon.owl file, but this is boggling my mind. Where did these "Homo sapiens" misspellings come from? Have we really been using a taxdmp.zip file from before 2014 this whole time?

jamesaoverton commented 4 years ago

Answering my own question: The current misspellings of "Homo sapiens" are coming from EBI's taxonomy.dat. I don't know where they get them from.

jamesaoverton commented 4 years ago

With this last commit I have "identical" outputs verified for a few taxa. "Identical" means exactly the same OWL coming out, except for the misspellings (which are not actually in the taxdmp.zip and so not in my new output) and without some of the merge assertions (for now).

jamesaoverton commented 4 years ago

This version covers the ~40 important OBO taxa from the "revised" sheet of this Google Sheet: https://docs.google.com/spreadsheets/d/16D7l0G-DL1Liv7yYFYVBEgRNpuNepCQCZoQTLYXXorA

jamesaoverton commented 4 years ago

This version covers ~700 taxa in the "unique" sheet of that Google Sheet.

jamesaoverton commented 4 years ago

I'm planning to do some cleanup tomorrow, but this code has the right output.

@cmungall Please assign somebody to set up a new Jenkins job, build your subsets, etc. based on this code. It would also be great to test it in working systems.

jamesaoverton commented 4 years ago

I'm done my cleanup.

obophenotype / ncbitaxon

Reimplement in Python #32