mwang87 / ReDU-MS2-GNPS

User interface to reanalyze and explore all public data in Metabolomics Public Data
https://redu.ucsd.edu/
MIT License
11 stars 9 forks source link

Limit new NCBI Taxon id's to the authority species. #260

Open oolonek opened 2 years ago

oolonek commented 2 years ago

In the current controlled vocabulary list, multiple species have a NCBI id which doesn't correspond to the authority species (for NCBI). This leads to unnecessary redundancy.

Example:

Penicillium expansum

The corresponding entry in the ReDU sheet is 1314791|Penicillium expansum.

However if we check the NCBI taxo list we have :

tax_id name_txt unique_name name_class
27334 Penicillium expansum Link, 1809 NaN authority
27334 Penicillium expansum NaN scientific name
1208580 Penicillium expansum NRRL 62431 NaN scientific name
1314791 Penicillium expansum ATCC 24692 NaN scientific name
1407458 Penicillium expansum T01 NaN scientific name

It in fact it is Penicillium expansum ATCC 24692 which is listed.

Until a better solution is found see #241 for possible directions to explore, it might be good to at least limit entries to the NCBI id corresponding to the authority entry for a species. I understand from #241 that the idea is not to use a list of all possible species, however if this can help in the process here is a list of NCBI taxa restricted to "authority" entries (both the authority and the simpler scientific name are kept). Total is 629822 entries (instead of 3650056) ncbi_id_authority_list.tsv.gz

Gist of the treatment script https://gist.github.com/oolonek/c89fa6f078a771b5259d45d890bcd724