nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
824 stars 681 forks source link

Alternative trimming step for polyA/T removal #663

Closed Reda94 closed 1 year ago

Reda94 commented 3 years ago

Hello, it looks like TrimGalore does not automatically perform polyA/T removal (see post-trimming fastqc screenshot below) ). They have an experimental option to do so (https://github.com/FelixKrueger/TrimGalore/blob/e9b8fd847f4da01fa3b886d134bc2ecd447a8068/trim_galore#L3230-L3257) but it would require running the nf-core/rnaseq pipeline twice (https://github.com/FelixKrueger/TrimGalore/blob/e9b8fd847f4da01fa3b886d134bc2ecd447a8068/trim_galore#L3248-L3254) (@drpatelh).

Would be great to have e.g. bbduk (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/) as it allows simultaneous base quality, adapter and polyA/T trimming.

Thanks!

Capture d’écran 2021-06-25 à 11 17 43

FelixKrueger commented 3 years ago

To just quickly chip in here. It is correct that the option --polyA in Trim Galore is designed to work with an RNA kit which specifically produces reads from polyA labelled transcripts (Thermo Fisher, Collibri) https://www.thermofisher.com/uk/en/home/life-science/sequencing/next-generation-sequencing/ngs-library-preparation-illumina-systems/collibri-stranded-rna-library-prep-kits.html?gclid=EAIaIQobChMIv7mB-tXn5QIVibHtCh3dTg7gEAAYASAAEgI7KPD_BwE&ef_id=EAIaIQobChMIv7mB-tXn5QIVibHtCh3dTg7gEAAYASAAEgI7KPD_BwE:G:s&s_kwcid=AL!3652!3!303512728519!p!!g!!collibri.

Standard removal of PolyA sequences could be accomplished using e.g. -a {A}10.

Trim Galore is does indeed not trim several things at the same time, and intentionally so. One could probably argue that you for the sequences you showed above, PolyA/PolyT sequence would remove all full homo-polymers entirely, and would truncate the PolyT sequences to a few nucleotides (certainly <15), which are either removed by a size-selection step, or will not map (uniquely). So in other words: these sequences are useless from a library point of view.

In your case, mapping will have the exact same effect though: PolyX sequences will probably either not map properly in the first place, or be soft-clipped (and then not map properly). The result should be pretty much the same: These sequences will not result in useful RNA-seq counts.

By all means, add another trimming option for the pipeline, but I don't think that it'll have a noteworthy advantage over ignoring PolyX sequences (and let the aligner take care of it), or even running an adapter trimming process first, and then using that output as the input for PolyA/T trimming (which should be just a matter of copy paste with DSL2).

It is still good to know if you have these contaminants in your library, as they may dramatically impact the mapping efficiency. You could add these sequences to a custom contaminants.txt file, and then run:

fastqc -a contaminants.txt *fastq.gz

to see the extent of PolyX sequences in your library.

Happy to discuss :) Felix

Illumina Universal Adapter          AGATCGGAAGAG
Illumina Small RNA Adapter          ATGGAATTCTCG
Nextera Transposase Sequence                CTGTCTCTTAT
#SOLiD Small RNA Adapter            CGCCTTGGCCGTA
Illumina Paired End PCR Primer 2        CTACACGACGC
PolyA                       AAAAAAAAAAAA
PolyT                       TTTTTTTTTTTT
PolyG                       GGGGGGGGGGGG
PolyC                       CCCCCCCCCCCC
drpatelh commented 3 years ago

Thank you for the detailed info @FelixKrueger ! I agree that most of these sequences won't align but I would personally prefer for these to removed at the "adapter" trimming step because it gives you an obvious assessment as to the fact that you may have such artifacts in the library. Trying to figure out why things haven't mapped can be quite painful.

I didn't know you could use -a {A}10 like that. Neat! In which case, you may be able to just tweak the pipeline options and run it once. Can you create a file called custom.config with the contents below @Reda94 and add -c custom.config to the command you are running?

params {
  modules {
    'trimgalore' {
      args = '--fastqc -a {A}10'
    }
  }
}

So would you need an additional -a {T}10 for polyT's @FelixKrueger or will the reverse complement be auto-trimmed?

FelixKrueger commented 3 years ago

Hi Harshil,

yes, args = '--fastqc -a {A}10' should work straight out of the box, but again, this will not be on top the adapter (auto-detection) and trimming, but this will be used as the adapter sequence itself. And if you look for -a AAAAAAAAAA it will trim - drumroll - PolyA! We are deliberately trying to keep things simple, and not do any obscure and undocumented reverse complement trimming as well...

drpatelh commented 3 years ago

Ah, I see! So it won't trim the conventional adapters if you use args = '--fastqc -a {A}10' in which case you would have to perform the adapter trimming twice anyway. Ok, maybe not the appropriate solution then unless you run the pipeline twice or if you are prepared to let the mapping take care of this.

FelixKrueger commented 3 years ago

It wouldn't necessarily be the entire pipeline twice, but just add an extra trimming step. So something like the nf-core equivalent of:


TRIM_GALORE                     (file_ch, params.outdir, params.trim_galore_args, params.verbose)
if (PolyX_trimming_will_make_my_day){
     params.trim_galore += " -a {10} "
     TRIM_GALORE                     (TRIM_GALORE.out.reads, params.outdir, params.trim_galore_args, params.verbose)
}

I would also split out FastQC as it can then be run in parallel rather then as part of Trim Galore:

FASTQC2                         (TRIM_GALORE.out.reads, params.outdir, params.fastqc_args, params.verbose)
FelixKrueger commented 3 years ago

but yea, in the end the results will be pretty much the same however you implement it :P

drpatelh commented 3 years ago

Sorry, I meant running only the trimming (and other mandatory) steps first time around and then running the pipeline full blown with either --polyA or -a {10} the second time around. Running the trimming twice isn't ideal 😅 Just out of interest why do you need to run the trimming twice for the polyA stuff?

FelixKrueger commented 3 years ago

For --polyA with the Collibri kit, it was actually a little complicated. But the bottom line was that you need to remove poor qualities and sequencing adapters first, and then as a second step find and remove poly-T sequences (I believe it was different for R1 and R2 as well, and write the found PolyT bases into the read IDs of both reads, and then use only one read for mapping (and use the readID for downstream processing). This protocol was specifically designed to identify alternative transcription end and PolyA initiation sites, and yea sequences with tons of AAAAA or TTTT are not exactly great for anything really...

Regarding your earlier quote:

Trying to figure out why things haven't mapped can be quite painful.

This is certainly true, but if you are running pipelines that will magically do lots of things at the same time you might just look at the mapping efficiency, see that it's great and move on - and potentially miss that you had 50% of sequences lost because of a technical issue that results in homo-polymers (which you should probably try and address on the wet lab side...)

drpatelh commented 3 years ago

you might just look at the mapping efficiency, see that it's great and move on

Good point! We also do have the cutadapt logs in the MultiQC report where this sort of info is easily accessible. But I guess it will come down to what is actually looked at further e.g. users may not even look at the mapping rates 🤷🏽 Another advantage of trimming beforehand is that there will most likely be a diversity of sequences with a variable count of these polyA bases - assuming that the aligner will handle these appropriately could be a little dangerous and even if it may be minimal this could actually impact the quantification.

tomsing1 commented 1 year ago

Just because this came up in a thread on slack: STAR can clip 3' polyA sequences, too, e.g. by adding the following STAR argument: --clip3pAdapterSeq AAAAAAAA

drpatelh commented 1 year ago

Added fastp support in https://github.com/nf-core/rnaseq/pull/970 which will allow you to use --trimmer fastp --extra_fastp_args '--trim_poly_x' to trim the polyA tails by default along with any adapter sequences.