jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
346 stars 81 forks source link

Is low taxonomic resolution to be expected? #838

Closed chrismitbiz closed 1 month ago

chrismitbiz commented 1 month ago

Hi all,

thanks for the awesome tool. I am new to metagenomics and unsure if this is a squeezemeta-specific issue or common (Sorry if this is a repeat question too).

I get a low species-level detection of bins in my data, hence wondered if that was to be expected?

If I understand my results correctly, then only 9 out of 275 bins were mapped to species level and 46 to genus-level. For the 45 high-quality bins (> 90% complete and < 10 contamination) only 1 was assigned a taxon to species-level and 7 to genus-level. Is that expected? My sense is that it is too low?

I ran six samples (anaerobic sludge) in co-assembly mode with megahit as following.

SqueezeMeta.pl -m coassembly -p ADSLUDGE -s trimmed-ADsludge.samples -f fastp_trimmed -a megahit -c 200 -map bowtie --nopfam -t 32

CONTIGS: Number of contigs: 1,889,160 Total length: 1,524,912,006 bases Longest contig: 472,058 bases Shortest contig: 200 bases N50: 966 bases N90: 359 bases

------------------- Statistics on bins DAS Number of bins 275 Complete >= 50% 168 Complete >= 75% 117 Complete >= 90% 59 Contamination < 10% 210 Contamination >= 50% 6 Congruent bins 58 Disparity >0 217 Disparity >= 0.25 62 Hi-qual bins (>90% complete,<10% contam) 45 Good-qual bins (>75% complete,<10% contam) 88

-- Created by 10.mapsamples.pl, Thu May 2 05:26:14 2024 Sample Total reads Mapped reads Mapping perc Total bases Chris_11 33312012 29029376 87.14 4866132042 Chris_12 30583940 26555482 86.83 4474878378 Chris_13 35009536 25089008 71.66 5093179376 Chris_14 34570590 30073255 86.99 5069742537 Chris_15 32645912 27136439 83.12 4704294304 Chris_16 35602780 27084697 76.07 5 171590316

I trimmed adapters and poly-Gs with fastp, using the full adapter sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCA, and AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT). Quality with fastQC looked great.

Thanks for any advice. Cheers, Chris

jtamames commented 1 month ago

Hello Hum, it is rather difficult to say what would be "expected". We have been deliberately conservative in everything related to taxonomic annotation. For instance, we set identity thresholds for the diverse taxonomic ranks, so that we do not assign any gene to species rank unless if it at least 90% identical with the ortholog gene from the same species. The rationale being that proteins belonging to same species are statistically observed to have at least that identity. SqueezeMeta diverges from other tools in details like this one. We rather favor quality over quantity. Details of this can be found in the manual. This said, there is a file called parameters.pl in the project directory that controls several behaviors of the pipeline. For instance, by altering the values of the $minconsperc_asig16 and $minconsperc_total16 you can play with the confidence values for the taxonomic assignment of the bins. Also, altering parameters $minconsperc_asig9 and $minconsperc_total9 affect the assignment of contigs (hence that of bins). Save your previous results, try changing these parameters and run the taxonomic assignments again (restarting in 16, or in 9 if you change the contig annotations, etc). Other ways to tinkering with the results exists, but for now I think you can try with these. Best, J

fpusan commented 1 month ago

To add to this, our very conservative approach is needed since the main source for taxonomy in SqueezeMeta are individual contigs (binning is optional) and individual contigs are on average hard to taxonomically annotate at high resolutions. However, reasonably complete bins can be easier since they will contain many high-resolution ~market~ marker genes. So if you are interested in your bins you can also annotate them with gtdb-tk, which will produce reliable high-resolution taxonomies.

chrismitbiz commented 1 month ago

Ok thank you both for your advice.