Taxonomy classification missing strain info

kassammo commented 5 years ago

Hello,

I have tried the co-assembly method. I was looking into the classification details on the files 10. and 06., I noticed that I got the species but not the strain information in my classification. Is-it normal ? and how i can modify to get more information about the strain ?

For instance: I am expecting to get Rothia mucilaginosa DY-18 in my samples. I am getting only the information superkingdom:Bacteria;no rank:Terrabacteria group;phylum:Actinobacteria;class:Actinobacteria;order:Micrococcales;family:Micrococcaceae;genus:Rothia;species:Rothia mucilaginosa

I have checked on the alltaxlist.txt file from the tool and the strain is there: 680646 Rothia mucilaginosa DY-18 no rank

Thanks,

Mohamed

fpusan commented 5 years ago

Dear Mohammed,

Currently we only report taxonomic ranks that are included as such in the NCBI taxonomy. Those would be superkingdom, phylum, class, family, genus and species. The main reason is that there can be an arbitrary number of "no_rank" entries between two ranked entries (e.g. corresponding to subclasses, superfamilies). These secondary ranks are very hard to work with when designing a Last Common Ancestor algorithm, because they can be or not be present in a given taxon.

To get the ORFs with hits to Rothia mucilaginosa DY-18 you can do the following with a bit of script fu:

Get the list of Rothia mucilaginosa DY-18 proteins from https://www.ncbi.nlm.nih.gov/genome/proteins/1812?genome_assembly_id=171700. Download it as a text file for convenience. You want the contents of the "Protein product" column.
Now go to the *.nr.diamond result from SqueezeMeta, which contains the hits of each ORF to the nr database. ORFs coming from your desired strain should have a quite good hit to any of the proteins listed in the table you just downloaded.
Note that we can't really recommend using reference-based methods to resolve strains, unless you really focus on the accessory part of the genome. But if all or most of the Rothia mucilaginosa DY-18 proteins (core + accesory) were actually found in your assembly, that's a good indicator that the strain (or something really close to it) is really present in the metagenome.
Once you get the list of putative DY-18 contigs, you can also look in the results/DAS/*DASTool_bins folder to check if they were all asigned to the same bin according to their abundances.

kassammo commented 5 years ago

Hello,

Thanks for the answer. I am not looking especially only at Rothia mucilaginosa DY-18. It was for me to compare the results i've got from kraken to see how similar results they are.

Meaning that there is not possibility to go beyond species.

Second question how difficult it is if I want to use a personal database for the taxonomy ?

Thanks

fpusan commented 5 years ago

Our recent preprint actually compares the results of several taxonomic and functional annotation pipelines, including the one we use for SqueezeMeta and kraken (https://www.biorxiv.org/content/biorxiv/early/2019/01/16/522292.full.pdf), although not down to the strain level.

Regarding the personal taxonomy database, it's not impossible, but it's quite hard. And in particular, the taxonomic ranks used for classification are currently hardcoded (so we can use the per-rank identity thresholds discussed in https://academic.oup.com/nar/article/42/8/e73/1076763). So it would be a bit of a large hack to get our taxonomy to the strain level.

Once again, I wouldn't personally recommend doing unsupervised homology-based strain-level taxonomy in metagenomes, unless you get really good results on complex mock communities first. We're currently testing DESMAN (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5607848/), which resolves strains by looking at variants in core genes. So far we're getting good results, and we might consider to include in the SqueezeMeta pipeline in the future.

kassammo commented 5 years ago

Hello thank you for all the information. I would be really happy to test a beta version of SqueezeMeta in case. I really like SqueezeMeta especially for functional analysis and binning. One thing is still difficult is the taxonomy assignment because this is the first we are looking after the first assembly. Thanks

kassammo commented 5 years ago

A quick question, if I want to try DESMAN , which file I should give for the configs .

Thanks

Mohamed

fpusan commented 5 years ago

We're not testing DESMAN with snakemake, so we can't provide you with a config.json file at the moment. We are currently writing and testing some scripts to generate DESMAN input files from the SqueezeMeta data. We roughly follow the tutorial at https://github.com/chrisquince/DESMAN/tree/master/complete_example, starting from the ExtractCountFreqGenes.py script. All the input data can be obtained by parsing SqueezeMeta results, SAM files can be obtained in the <project_name>/data directory, you can convert them into sorted bamfiles with samtools.

kassammo commented 5 years ago

Ok I see.

I will try to instakk the tool. I will let you know if I need some inputs. Do you think thatvi can use coassembly results for the SNP analysis

Thanks

fpusan commented 5 years ago

It's worth trying, although we ourselves are still familiarizing with DESMAN. I'd say that if you have good mapping percentages in your *mappingstat file, there are good chances for it to work.

jtamames / SqueezeMeta

Taxonomy classification missing strain info #11