Closed seppedm closed 3 years ago
Hello RNAs are also annotated in the final results (table number 13). The masked sequences are just used for not predicting a protein where we already know there is a RNA. RNA predictions are done first (barrnap/aragorn in step 2), and then CDS predictions follow (Prodigal in step 3). But RNAs detected in step 2 are included in the gff file that is used to quantify reads (in step 10).
@fpusan will surely tell you more, but SQMtools is using all annotations. Therefore, all counts you see include also the RNAs contribution.
Best, Javier
Actually, SQMtools loads all the ORFs, but RNAs are not used in plotTaxonomy
(as we only use protein coding sequences for taxonomic annotation).
You can easily see what proportion of reads are assigned to RNAs with SQMtools.
### Assuming your project was loaded under the SQM variable
# How many ORFs (not reads!) are RNAs?
table(SQM$orfs$table$Molecule) # CDS are protein-coding, the rest are different types of RNAs
# Which ORFs are RNAs? (or rather, not CDS)
RNAorfs = rownames(SQM$orfs$table)[SQM$orfs$table$Molecule != 'CDS']
# What percent of the total ORF abundance comes from RNA-encoding ORFs?
percent = 100 * colSums(SQM$orfs$abund[RNAorfs ,]) / colSums(SQM$orfs$abund)
Thank you both for your superquick responses!! If I get it correctly, those reads assigned to rRNA's and tRNA's then appear in the plotTaxonomy plots as "unclassified"? This would explain (quite a big) part of the problem with low taxonomic assignment as put forward in previous topics on metatranscriptomics, eg. the 27feb comment on this thread. Does that also happen in sqm_reads.pl?
If this is the case, a suggestion for the next SQMtools version, would be to indicate that part of the "unclassified" reads are assigned to rRNA :).
Uh... yes, I think they would show as "Unclassified" and that this would explain the results if rRNAs had not been depleted before by some kit. This makes so much sense. I can code a function to remove RNA orfs so that plotTaxonomy can be done over the subset of "classifyable" reads. Does this make sense to you too @jtamames. I went quickly over a test dataset and the RNA orfs indeed had no taxonomic classification.
rRNA depletion can only remove part of the rRNA present, so substantial amounts will still remain (our samples were also ribodepleted). For the graph provided in the mentioned thread, the percentages that I got from the code you provided are the following:
T3 T6 T38 T41 T111 T114 T146 T149 T219 T222 T254 T257
40.53900 23.20716 51.70015 51.82594 34.43555 45.62397 56.87413 43.06679 51.44557 43.00561 43.83011 53.74397
And abundances, for comparison with the graph:
T3 T6 T38 T41 T111 T114 T146 T149 T219 T222 T254 T257
22516487 10262516 28312372 27885943 15393361 23559872 31496414 24920986 27632513 22910132 24116293 37406395
As I'm redrawing the graphs manually in ggplot, I can perfectly write the code for including these myself, but of course feel free to share it for other interested people!
An additional reason why the problem wasn't noticed before was that we actually use contig taxonomy, rather than ORF taxonomy, for making the taxonomy plots. In a metagenome, rRNAs and tRNAs will often be in the same contig as other protein coding genes, so the contig will get classified and happiness will ensue. However this will not happen in a metatranscriptome (save for some polycystronic transcripts, but I don't think that applies here) and also rRNA genes are much more represented there...
Regarding the filtering, it is quite straightforward. Something along these lines should work.
CDSorfs = rownames(SQM$orfs$table)[SQM$orfs$table$Molecule == 'CDS']
newSQM = subsetORFs(SQM, CDSorfs)
plotTaxonomy(newSQM, rescale=T)
Thank you very much for all these insights!!! It really solves some big questions I have had for a long while. The subsetting is loading now, but for 12 million ORFs that takes a looong while.
Sidequestion I've been wondering about for a while: Is there a way to save subsetted SQM objects, since it would take very long to reload this data every time I want to continue working on my data, as I have no way to keep it loaded
You can save them just as any other R object (https://stat.ethz.ch/R-manual/R-devel/library/base/html/save.html).
In my experience R is not super fast while saving/loading session data so maybe not such a big improvement.
Just to double check, you are using engine='data.table
to load your data, right?
I am indeed using engine='data.table
. Saving the object works (just save(object, file = "path"
and object <- load("path")
, but loading doesn't give me an object. In this case I expect it to be quite a bit quicker than reloading and subsetting every time.
It has loaded and a substantially bigger percentage of the reads is counted as assigned now, as shown in the plots: (yay!) Original percentage plots Percentage plots after removing rRNAs from SQM object
However, it seems that part of the reads that were 'taxonomically assigned' have disappeared. For example, the unexpectedly large amount of acidobacteria in T257 have disappeared, but reads have disappeared over the whole range of samples. How would you explain this?
Original abundance plots Abundance plots after removing rRNAs from SQM object
What are exactly the four plots you sent?
Nevermind, just try this for the subsetting
newSQM = subsetORFs(SQM, CDSorfs, tax_source='contigs')
An idea I just had: you say taxonomic plots are made on contig taxonomy, so counting the number of reads that map on the contig? If there thus are some polycistronic contigs with both a protein coding gene and an rRNA transcript on it, the reads mapping to that rRNA part, will also be counted in the taxonomic plots, right? This also potentially explains the differences portrayed in the plots.
First two plots are the taxplots before and after filtering the coding ORFs in percentages. The third and fourth plot are the same, but in raw reads and without the unmapped reads, to see the absolute numbers.
Did you try the adding the tax_source='contigs'
part? It should not get rid of extra reads.
I'm sorry, I responded too quickly, I didn't yet, tomorrow I will let you know the result, because the subsetting takes more than two hours it seems
Two hours with engine=data.table
, wow. Let me know if it works. I'm working on a patch for next version so that there's no need to do the subsetting. Those contigs will be classified as "No CDS" instead, so it will be very easy to distinguish non-protein-coding reads from truly Unclassified reads.
Hi Fernando,
okay, today with a less overloaded server it took only 1.5h. Because it's transcriptomics, there are lots of small contigs, which makes for large objects I guess.
I'm happy to tell you your tax_source='contigs'
fix works!! The graphs now show the exact same percentages as before.
However, my question from four posts earlier still stands: I would like to ignore the reads that are mapped to rRNA parts of the contigs, because these would skew the taxonomic classification towards the polycistronic contigs which partly consist of rRNA (either rightfully, or because they may be chimaeras). Maybe just newSQM = subsetORFs(SQM, CDSorfs)
does give a more 'correct' representation of the ratios of protein coding sequences per taxon?
The newSQM
object resulting from newSQM = subsetORFs(SQM, CDSorfs, tax_source='contigs')
will remove all contigs that do not contain at least one protein-coding gene. So if I get your previous post correctly, it should be exactly what you want, right?
After playing around with it for a bit, I do get that it gives me what I need.
Thank you so much for your help, this has really improved my understanding of the data a lot!!
Hi all,
An additional question here: if I use sqm_reads.pl, are rRNA reads then included in taxonomic annotation, or is annotation limited to CDS sequences?
Thanks! Seppe
Hello No, sqm_reads runs a blastx search, therefore it will identify just protein matches. Best, Javier
Thank you!!
Dear developers,
For my analysis, I'm trying to disentangle what happens to the rRNAs (and tRNAs) that have been identified in the first steps. I see that in the intermediate folder, there is a maskedrna.fasta file, but is this file also the one used for the further steps?
Is there a way to see what proportion of the reads are assigned to t/rRNA?
If I load a project into SQMtools (without binning), are these reads and assignments still included in the analysis? In other words: if I look at my plotTaxonomy plots, are these assigned reads only the non-rRNA reads, or all of them? And if so, is there a way to only look at the non-rRNA reads?
Thanks!