jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
365 stars 78 forks source link

large percentage of unclassified taxon from family level. #604

Closed jahid10 closed 1 year ago

jahid10 commented 1 year ago

Hi, I am using SQMtools in R to analyze my SqueezeMeta run. However, I'm seeing a large amount of unclassified contigs at the family to downwards. I understnad this means the contigs didn't have a significant matches at that level of taxa to assign them. So, they are assigned as unclassified to a higher level. However, unclassified taxa can be ignored in plottaxonomy function. I want to know the ramification of ignoring unclassified taxa. For example, if I ignore the unclassified portion, will that affect the credibility of the data if I want to publish this data? what are the significance of ignoring unclassified taxa? I have seen many papers where they are only showing taxa which are classified. Can you please give some suggestion about it. b6c0510d-70c8-4d91-b37e-fed3e40c614a 298565b9-ce5e-4bb5-901f-280833f0b8ad

fpusan commented 1 year ago

Hi! What version of SqueezeMeta is this? Are some of these samples metatranscriptomes?

jahid10 commented 1 year ago

Hi, This is SqueezeMeta 1.6. No, these samples are from metagenomics and they are from aquatic fermentive environment.

fpusan commented 1 year ago

How comes there are no Unmapped reads? Or is this a sqm_reads.pl or sqm_longreads.pl project?

jahid10 commented 1 year ago

I have used the ignore_unmapped =T, option when I used the plottaxonomy function.

fpusan commented 1 year ago

Oh, ok. Then the answer to your question is: it depends on what you are doing with the data.

If you are discussing patterns within samples, then it is ok to remove the Unmapped and Unclassified. E.g. the sentence "Sample SRW has more Chromobacteriaceae than Moraxellaceae" will be true regardless on whether the Unmapped and Unclassified groups are included.

However, if you are discussing patterns between samples, it may be more tricky. E.g. the sentence "There is the same proportion of Methanosarcinae in sample SRW_d05 than in SRW_d16" would be true when you consider the figure as it is now, but it would not be true if you had removed the Unclassified reads and rescaled the data.

In any case you still should report the percentage of unmapped and unclassified reads somewhere in the manuscript, even if you end up not including them in the figure.

jahid10 commented 1 year ago

That answers my question. Thank you. However, I'm worried about the percentage of unmapped and unclassified taxa I found in my sample. Can you suggest me anything which can improve this condition? should I go for (sqm_reads.pl) project. Or should I set the cutoff value in LCA algorithm lower?

jtamames commented 1 year ago

Hello Regarding unmapped reads, you can use the --singletons option to include the unmapped reads as new contigs. Be aware that this will increase substantially the computing time. Analyzing reads with sqm_reads or sqm_longreads is another option, yes. Regarding unclassified, the first thing you can try is removing the identity filters we use in the annotation (see the manual or the wiki for a detailed explanation). In the results directory you will see two 06*wranks files, one of which is labelled as "noidfilter". This contains the annotations with dropped identity filters. Just rename the "noidfilter" file with the name of the other wranks file, and rerun steps 09, 11, 13, 19 and 21. I would keep copies of all these original files to preserve the results. You would probably see more annotations and deeper in taxonomy. You can also set up the parameters of the LCA algorithm in the file parameters.txt in the project directory. You can try changing the ones affecting step 6, and the rerun step 6 again. But, in my opinion quality is always preferable to quantity. Having more results but less trustworthy is not a good deal to me. But its on you what your prefer the better.

Best, Javier

jahid10 commented 1 year ago

Thank you. I have performed read based taxonomic using centrifuge v1.04. I got lesser amount of unclassified taxa in the genus level. My concern here is when I subset the data under a specific function as shown in the wiki, I also get a number of unclassified taxa for a function.

jtamames commented 1 year ago

That can happen. A gene can have a functional classification but not a taxonomic one. That's why you are getting unclassified when subsetting functions. Regarding centrifuge, as I said above, quantity is not the same than quality. Some time ago we wrote a paper about that. Should you want to check it: https://pubmed.ncbi.nlm.nih.gov/31823721/

Best, J