jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
363 stars 78 forks source link

rRNA in pipeline #279

Closed seppedm closed 3 years ago

seppedm commented 3 years ago

Dear developers,

For my analysis, I'm trying to disentangle what happens to the rRNAs (and tRNAs) that have been identified in the first steps. I see that in the intermediate folder, there is a maskedrna.fasta file, but is this file also the one used for the further steps?

Is there a way to see what proportion of the reads are assigned to t/rRNA?

If I load a project into SQMtools (without binning), are these reads and assignments still included in the analysis? In other words: if I look at my plotTaxonomy plots, are these assigned reads only the non-rRNA reads, or all of them? And if so, is there a way to only look at the non-rRNA reads?

Thanks!

jtamames commented 3 years ago

Hello RNAs are also annotated in the final results (table number 13). The masked sequences are just used for not predicting a protein where we already know there is a RNA. RNA predictions are done first (barrnap/aragorn in step 2), and then CDS predictions follow (Prodigal in step 3). But RNAs detected in step 2 are included in the gff file that is used to quantify reads (in step 10).

@fpusan will surely tell you more, but SQMtools is using all annotations. Therefore, all counts you see include also the RNAs contribution.

Best, Javier

fpusan commented 3 years ago

Actually, SQMtools loads all the ORFs, but RNAs are not used in plotTaxonomy (as we only use protein coding sequences for taxonomic annotation). You can easily see what proportion of reads are assigned to RNAs with SQMtools.

### Assuming your project was loaded under the SQM variable
# How many ORFs (not reads!) are RNAs?
table(SQM$orfs$table$Molecule) # CDS are protein-coding, the rest are different types of RNAs
# Which ORFs are RNAs? (or rather, not CDS)
RNAorfs = rownames(SQM$orfs$table)[SQM$orfs$table$Molecule != 'CDS']
# What percent of the total ORF abundance comes from RNA-encoding ORFs?
percent = 100 * colSums(SQM$orfs$abund[RNAorfs ,]) / colSums(SQM$orfs$abund)
seppedm commented 3 years ago

Thank you both for your superquick responses!! If I get it correctly, those reads assigned to rRNA's and tRNA's then appear in the plotTaxonomy plots as "unclassified"? This would explain (quite a big) part of the problem with low taxonomic assignment as put forward in previous topics on metatranscriptomics, eg. the 27feb comment on this thread. Does that also happen in sqm_reads.pl?

If this is the case, a suggestion for the next SQMtools version, would be to indicate that part of the "unclassified" reads are assigned to rRNA :).

fpusan commented 3 years ago

Uh... yes, I think they would show as "Unclassified" and that this would explain the results if rRNAs had not been depleted before by some kit. This makes so much sense. I can code a function to remove RNA orfs so that plotTaxonomy can be done over the subset of "classifyable" reads. Does this make sense to you too @jtamames. I went quickly over a test dataset and the RNA orfs indeed had no taxonomic classification.

seppedm commented 3 years ago

rRNA depletion can only remove part of the rRNA present, so substantial amounts will still remain (our samples were also ribodepleted). For the graph provided in the mentioned thread, the percentages that I got from the code you provided are the following:

      T3       T6      T38      T41     T111     T114     T146     T149     T219     T222     T254     T257 
40.53900 23.20716 51.70015 51.82594 34.43555 45.62397 56.87413 43.06679 51.44557 43.00561 43.83011 53.74397 

And abundances, for comparison with the graph:
      T3       T6      T38      T41     T111     T114     T146     T149     T219     T222     T254     T257 
22516487 10262516 28312372 27885943 15393361 23559872 31496414 24920986 27632513 22910132 24116293 37406395

As I'm redrawing the graphs manually in ggplot, I can perfectly write the code for including these myself, but of course feel free to share it for other interested people!

fpusan commented 3 years ago

An additional reason why the problem wasn't noticed before was that we actually use contig taxonomy, rather than ORF taxonomy, for making the taxonomy plots. In a metagenome, rRNAs and tRNAs will often be in the same contig as other protein coding genes, so the contig will get classified and happiness will ensue. However this will not happen in a metatranscriptome (save for some polycystronic transcripts, but I don't think that applies here) and also rRNA genes are much more represented there...

Regarding the filtering, it is quite straightforward. Something along these lines should work.

CDSorfs = rownames(SQM$orfs$table)[SQM$orfs$table$Molecule == 'CDS']
newSQM = subsetORFs(SQM, CDSorfs)
plotTaxonomy(newSQM, rescale=T)
seppedm commented 3 years ago

Thank you very much for all these insights!!! It really solves some big questions I have had for a long while. The subsetting is loading now, but for 12 million ORFs that takes a looong while.

Sidequestion I've been wondering about for a while: Is there a way to save subsetted SQM objects, since it would take very long to reload this data every time I want to continue working on my data, as I have no way to keep it loaded

fpusan commented 3 years ago

You can save them just as any other R object (https://stat.ethz.ch/R-manual/R-devel/library/base/html/save.html). In my experience R is not super fast while saving/loading session data so maybe not such a big improvement. Just to double check, you are using engine='data.table to load your data, right?

seppedm commented 3 years ago

I am indeed using engine='data.table. Saving the object works (just save(object, file = "path" and object <- load("path"), but loading doesn't give me an object. In this case I expect it to be quite a bit quicker than reloading and subsetting every time.

It has loaded and a substantially bigger percentage of the reads is counted as assigned now, as shown in the plots: (yay!) Original percentage plots afbeelding Percentage plots after removing rRNAs from SQM object afbeelding

However, it seems that part of the reads that were 'taxonomically assigned' have disappeared. For example, the unexpectedly large amount of acidobacteria in T257 have disappeared, but reads have disappeared over the whole range of samples. How would you explain this?

Original abundance plots afbeelding Abundance plots after removing rRNAs from SQM object afbeelding

fpusan commented 3 years ago

What are exactly the four plots you sent?

fpusan commented 3 years ago

Nevermind, just try this for the subsetting newSQM = subsetORFs(SQM, CDSorfs, tax_source='contigs')

seppedm commented 3 years ago

An idea I just had: you say taxonomic plots are made on contig taxonomy, so counting the number of reads that map on the contig? If there thus are some polycistronic contigs with both a protein coding gene and an rRNA transcript on it, the reads mapping to that rRNA part, will also be counted in the taxonomic plots, right? This also potentially explains the differences portrayed in the plots.

First two plots are the taxplots before and after filtering the coding ORFs in percentages. The third and fourth plot are the same, but in raw reads and without the unmapped reads, to see the absolute numbers.

fpusan commented 3 years ago

Did you try the adding the tax_source='contigs' part? It should not get rid of extra reads.

seppedm commented 3 years ago

I'm sorry, I responded too quickly, I didn't yet, tomorrow I will let you know the result, because the subsetting takes more than two hours it seems

fpusan commented 3 years ago

Two hours with engine=data.table, wow. Let me know if it works. I'm working on a patch for next version so that there's no need to do the subsetting. Those contigs will be classified as "No CDS" instead, so it will be very easy to distinguish non-protein-coding reads from truly Unclassified reads.

seppedm commented 3 years ago

Hi Fernando, okay, today with a less overloaded server it took only 1.5h. Because it's transcriptomics, there are lots of small contigs, which makes for large objects I guess. I'm happy to tell you your tax_source='contigs' fix works!! The graphs now show the exact same percentages as before.

However, my question from four posts earlier still stands: I would like to ignore the reads that are mapped to rRNA parts of the contigs, because these would skew the taxonomic classification towards the polycistronic contigs which partly consist of rRNA (either rightfully, or because they may be chimaeras). Maybe just newSQM = subsetORFs(SQM, CDSorfs) does give a more 'correct' representation of the ratios of protein coding sequences per taxon?

fpusan commented 3 years ago

The newSQM object resulting from newSQM = subsetORFs(SQM, CDSorfs, tax_source='contigs') will remove all contigs that do not contain at least one protein-coding gene. So if I get your previous post correctly, it should be exactly what you want, right?

seppedm commented 3 years ago

After playing around with it for a bit, I do get that it gives me what I need.

Thank you so much for your help, this has really improved my understanding of the data a lot!!

seppedm commented 3 years ago

Hi all,

An additional question here: if I use sqm_reads.pl, are rRNA reads then included in taxonomic annotation, or is annotation limited to CDS sequences?

Thanks! Seppe

jtamames commented 3 years ago

Hello No, sqm_reads runs a blastx search, therefore it will identify just protein matches. Best, Javier

seppedm commented 3 years ago

Thank you!!