jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
346 stars 81 forks source link

SQMtools taxonomy output #776

Closed verubel closed 5 months ago

verubel commented 5 months ago

Question about output tables created by SQMtools: I am interested in the taxonomy output tables, e.g. „genus.nofilter.abund“, I was wondering if there is a possibility to extract the individual nucleotide sequence which has been used for the assignment? Or if there is a possibility to extract NCBI accession numbers for each row?

fpusan commented 5 months ago

Hi! Yes, this info can be found in /path/to/project/intermediate/04.project.nr.diamond For each ORF in your dataset, this will contain all the NCBI accession numbers to which it aligned. Some of those NCBI sequences may not end up been used for the assignment (eg if they have too low identity or evalue). See the manual for more details on how taxonomic assignment is done

verubel commented 5 months ago

Hi, thanks for the response, but I am still wondering how I can get the NCBI Accession Numbers (or sequences) for the individual OTUs which are produced by SQMtools for the taxonomic assignment tables on species/genus/family level etc. Using the manual, I am still not sure how I can extract the information I need. In the SQMtools taxonomic assignment output, there is only a fraction of OTUs available which I can not assign to the ORFs in /path/to/project/intermediate/04.project.nr.diamond

fpusan commented 5 months ago

What do you mean exactly by OTUs? We do not produce those...

jtamames commented 5 months ago

Hello

I assume you are referring to SqueezeMeta taxonomic annotation (since SQMtools is not producing annotations, just uses the ones provided by the SqueezeMeta analysis). The short answer is that there is no way to retrieve a single reference (GenBank sequence) used for the annotation, since such annotation is not done based on a single sequence, but rather on a range of them. Please check the manual for details on how the taxonomic annotation is being done. The closest thing you can have, as Fernando says, is the best hit for each of the ORFs, which can be found in the /path/to/project/intermediate/04.project.nr.diamond file. Just take the first hit for each ORF.

Hope it helps Best, J