Update NCBI taxonomy - Githubissues

peterjc commented 5 years ago

Just checked, and the latest NCBI taxonomy dump (September 2019) is essentially unchanged from Jan 2019 for the oomycetes:

$ diff  names-2019-01.dmp names-2019-09.dmp
9,11c9,11
< 2759  |   eukaryotes  |       |   common name |
< 2759  |   eukaryotes  |   eukaryotes <blast2759>  |   blast name  |
< 4762  |   oomycetes   |   oomycetes <blast4762>   |   blast name  |
---
> 2759  |   eukaryotes  |   eukaryotes <blast name> |   blast name  |
> 2759  |   eukaryotes  |   eukaryotes <common name>    |   common name |
> 4762  |   oomycetes   |       |   blast name  |
17c17
< 33634 |   Chromophyta |   Chromophyta <stramenopiles> |   in-part |
---
> 33634 |   Chromophyta |   Chromophyta <eukaryotes>    |   in-part |

Used https://github.com/abaizan/kodoja/blob/master/test/taxonomy/filter_taxonomy.py to generate the filtered names.dmp files.

However, will want to periodically review this - is it something worth including the continuous integration tests (with a monthly cron job say), to flag when there is a relevant change in the taxonomy?

i.e. Instead of fetching a fixed version, could build against the latest taxonomy, and check the output from the import commands?

peterjc commented 5 years ago

Looks like I was wrong, do have new entries including:

2065305 Phytophthora humicola x Phytophthora inundata

And, there has been a change in how hybrids are listed:

$ grep 324745 new_taxdump_2019-01-01/names.dmp 
324745  |   Phytophthora medicaginis x cryptogea    |       |   scientific name |
1324745 |   Vibrio sp. EF1C-CB167   |       |   scientific name |
2324745 |   Ceratopogonidae sp. BBDCN479-10 |       |   scientific name |
$ grep 324745 new_taxdump_2019-09-01/names.dmp 
324745  |   Phytophthora medicaginis x Phytophthora cryptogea   |       |   scientific name |
1324745 |   Vibrio sp. EF1C-CB167   |       |   scientific name |
2324745 |   Ceratopogonidae sp. BBDCN479-10 |       |   scientific name |

The old style worked better with our parser and loading, using genus Phytophthora and species medicaginis x cryptogea, will probably have to refactor the load-tax code to handle this.

Or, stop splitting the text into genus+species (removing the genus from the species field), and leave the genus in the species field (usually redundant)?

Also need to think about how this matches the names used in NCBI FASTA files on import.

peterjc commented 5 years ago

We're not currently importing the NCBI format files at species level, but still need to look at handling of hybrids with either naming style...

peterjc commented 5 years ago

Closed via #179, but will want to review this again at some point...

peterjc / thapbi-pict

Update NCBI taxonomy #176