qunfengdong / BLCA

34 stars 12 forks source link

Dealing with subspecies in taxonomy mapping #16

Closed dswan closed 5 years ago

dswan commented 5 years ago

Hi, in your docs it says that for a custom database you need:

A taxonomy file with two columns, sequence ID in fasta file, and its taxonomy from superkingdom to species in the following format (The deliminator between the sequence ID and taxonomy information should be a tab [\t]):

I assumed that this meant you'd require a 7 level taxonomy tree. However if you actually use BLCA to run and parse the NCBI 16S database you get entries like:

NR_157602.1     subspecies:Leuconostoc mesenteroides subsp. jonggajibkimchii;species:Leuconostoc mesenteroides;genus:Leuconostoc;family:Leuconostocaceae;order:Lactobacillales;class:Bacilli;phylum:Firmicutes;superkingdom:Bacteria;
NR_159094.1     subspecies:Macrococcus caseolyticus subsp. hominis;species:Macrococcus caseolyticus;genus:Macrococcus;family:Staphylococcaceae;order:Bacillales;class:Bacilli;phylum:Firmicutes;superkingdom:Bacteria;
NR_115038.2     subspecies:Clavibacter michiganensis subsp. sepedonicus;species:Clavibacter michiganensis;genus:Clavibacter;family:Microbacteriaceae;order:Micrococcales;class:Actinobacteria;phylum:Actinobacteria;superkingdom:Bacteria;

Is the inclusion of subspecies here an issue? Or the desired behaviour? These have 8 levels of taxonomic information.

dswan commented 5 years ago

OK, well obviously I can test this, and I can see that for instance, if you submit the sequence for:

>NR_074997.2 Leuconostoc gelidum subsp. gasicomitatum strain TB 1-10 16S ribosomal RNA, complete sequence

to the default NCBI database, then it will return:

superkingdom:Bacteria;100.0;phylum:Firmicutes;100.0;class:Bacilli;100.0;order:Lactobacillales;100.0;family:Leuconostocaceae;100.0;genus:Leuconostoc;100.0;species:Leuconostoc gelidum;100.0;

based on the parsed taxonomy of:

subspecies:Leuconostoc gelidum subsp. gasicomitatum;species:Leuconostoc gelidum;genus:Leuconostoc;family:Leuconostocaceae;order:Lactobacillales;class:Bacilli;phylum:Firmicutes;superkingdom:Bacteria;

This however represents a couple of similar entries in the 16S database. If I 'poison' the database with a single non 16S sequence, but with a valid taxonomy, and then query with the poison sequence, I can see that subspecies is never reported in the results.

Is there any issue with including the subspecies as part of the species name and dropping the subspecies part of the taxonomy in a custom database? Or will BLCA use the subspecies information if appropriate? Having re-read the paper in a bit more detail, it does suggest that it can do this..

qunfengdong commented 5 years ago

With our current version, subspecies is ignored; this is due to the fact that the vast majority of taxonomic info at NCBI do not have subspecies (only a few 16S entries have subspecies). If you truly wants a version that can also include subspecies as output, please let us know and we can modify the script.

On Mon, May 27, 2019 at 8:06 AM Dr. Daniel Swan notifications@github.com wrote:

OK, well obviously I can test this, and I can see that for instance, if you submit the sequence for:

NR_074997.2 Leuconostoc gelidum subsp. gasicomitatum strain TB 1-10 16S ribosomal RNA, complete sequence

to the default NCBI database, then it will return:

superkingdom:Bacteria;100.0;phylum:Firmicutes;100.0;class:Bacilli;100.0;order:Lactobacillales;100.0;family:Leuconostocaceae;100.0;genus:Leuconostoc;100.0;species:Leuconostoc gelidum;100.0;

based on the parsed taxonomy of:

subspecies:Leuconostoc gelidum subsp. gasicomitatum;species:Leuconostoc gelidum;genus:Leuconostoc;family:Leuconostocaceae;order:Lactobacillales;class:Bacilli;phylum:Firmicutes;superkingdom:Bacteria;

So I have a supplementary question - is subspecies ignored? Or is it just because there are >1 entries for this particular subspecies sequence and the parent node is what is being reported?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/16?email_source=notifications&email_token=AEOBXE6WCXMBLUG575NUSDTPXPMFZA5CNFSM4HP3RUBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWJYHDY#issuecomment-496206735, or mute the thread https://github.com/notifications/unsubscribe-auth/AEOBXE3UOD4KLKU7MNSZVTLPXPMFZANCNFSM4HP3RUBA .

dswan commented 5 years ago

It would be nice for some genera but I suspect they'd be hard to tease apart anyway even with full-length 16S. I've incorporated the information into my custom DB but it doesn't matter that it's not going to be used, just wanted to check my understanding!

qunfengdong commented 5 years ago

In the next a few weeks, we will update BLCA and add an additional flag for users who choose to output the subspecies information. Our apology for taking some time to get to it --- working on a few other projects now. But please stay tuned.

On Tue, May 28, 2019 at 11:53 AM Dr. Daniel Swan notifications@github.com wrote:

It would be nice for some genera but I suspect they'd be hard to tease apart anyway even with full-length 16S. I've incorporated the information into my custom DB but it doesn't matter that it's not going to be used, just wanted to check my understanding!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/qunfengdong/BLCA/issues/16?email_source=notifications&email_token=AEOBXE4A7ONKHC3UQ7UZM6LPXVPQ5A5CNFSM4HP3RUBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWMYEVQ#issuecomment-496599638, or mute the thread https://github.com/notifications/unsubscribe-auth/AEOBXEYBY5PE4MIOSELLXT3PXVPQ5ANCNFSM4HP3RUBA .