qunfengdong / BLCA

34 stars 12 forks source link

Candidate phylum NC10 missing from taxonomy #9

Closed koopkaup closed 5 years ago

koopkaup commented 6 years ago

NCBI 16S database has candidate phylum NC10 included but after compiling database for BLCA it is missing from 16SMicrobial.ACC.taxonomy What could be the reason?

yingeddi2008 commented 6 years ago

Hi Koopkaup,

I've done some digging, and couldn't locate the problem you raised here. If I understand it correctly, you think that some sequences were missing from the database after formatting the taxonomy file. I will need your help to specifically pinpoint the issue here. As you mentioned, there is no "NC10" present in the 16SMicrobial.ACC.taxonomy file, so I couldn't find something that is missing. Could you please provide some hints/clues? Such as, which sequence ID you saw in the database, but is not present in the processed taxonomy file? I will need a concrete example.

Upon examining the latest 16S NCBI database, there are a total of 19,763 sequences included, and there are 20,040 corresponding taxonomy information matched to the 19,763 sequences. The discrepancy is due to the fact that several entries of the 16S database share the same sequences, but are of different species/strains. For example, NR_074334.1, NR_118873.1, and NR_119237.1 have the same sequences, but are of different strains. Hence, their taxonomy information are the same:

NR_119237.1 species:Archaeoglobus fulgidus;genus:Archaeoglobus;family:Archaeoglobaceae;order:Archaeoglobales;class:Archaeoglobi;phylum:Euryarchaeota;superkingdom:Archaea; NR_118873.1 species:Archaeoglobus fulgidus;genus:Archaeoglobus;family:Archaeoglobaceae;order:Archaeoglobales;class:Archaeoglobi;phylum:Euryarchaeota;superkingdom:Archaea;

This is the only discrepancy I can find regarding to the NCBI 16s database and the taxonomy file. And no taxonomy information was omitted.

Also, when formatting the taxonomy file, I noticed that it was unable to find taxonomy information for three taxIDs: 415850,1141877,1. I looked them up at NCBI taxonomy website. The reason why the first two (415850 and 1141877) can't be found was because they were merged into another taxon. TaxID 1 is the root, so it is normal that no taxonomy information can be found. In conclusion, other than the three taxIDs, no taxonomy information was omitted.

After the above two rounds of testing and researching, I couldn't find the phylum NC10 you mentioned. It seems to me that it was not included in the NCBI 16s database in the first place.

Please remind me of anything I missed, or misunderstood, Thanks,

Eddi

koopkaup commented 6 years ago

We are intereseted in a methanotroph Candidatus Methylomirabilis oxyfera. In the NCBI 16S database taxonomy names file (names.dmp) are 17 entries mentioning Methylomirabilis, however in the 16SMicrobial.ACC.taxonomy file there is none.

yingeddi2008 commented 6 years ago

Hi Koopkaup,

The names file (names.dmp) is not only for the NCBI 16s database, but it is for all the entries in the NCBI database. The 16S database is only a small subset of the entire NCBI database. So it is normal to have some bacteria present in the names.dmp, but not included in the NCBI 16S database.

So if you want the specific 17 Methylomirabilis entries, you will have to compile your own database, as it is explained in the 'training your own database' section.

Please let me know if you have other questions,

Eddi