nf-core / smrnaseq

A small-RNA sequencing analysis pipeline
https://nf-co.re/smrnaseq
MIT License
72 stars 122 forks source link

Problem with 3 and 5' clip using nextflex workflow #362

Closed matbou85 closed 1 month ago

matbou85 commented 1 month ago

Description of the bug

I am running smRNAseqon an HPC. My samples were prepared using the Nextflex library. The run was successful, please look at the code below, but the 4 bases from the 3 and 5' seem not to have been clipped.

Command used and terminal output

nextflow run:

nextflow run /path/to/pipelines/smrnaseq-2.3.1/main.nf \
-profile docker \
--input /path/to/samplesheet.csv \
--outdir /path/to/output \
--protocol 'nextflex' \
--fasta /path/to/genomes/homo_sapiens/Gencode/41/GRCh38.primary_assembly.genome.fa \
--mirna_gtf /path/to/genomes/homo_sapiens/miRBase/hsa.gff3 \
--mature /path/to/genomes/homo_sapiens/miRBase/mature.fa \
--hairpin /path/to/genomes/homo_sapiens/miRBase/hairpin.fa \
--mirtrace_species hsa

Please see the first few lines of my original fastq file:

zcat ###_L2_S1_L001_R1_001.fastq.gz | head -100
@A00556:142:HG57WDRXY:1:2101:6126:1016 1:N:0:AAGACATA
CNACGACCTGTACTGACGCTCATGCGTGCTGGAATTCTCGGGTGCCAAGGA
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00556:142:HG57WDRXY:1:2101:11153:1016 1:N:0:AAGACATA
ANTATTCTTGACGACCATAGAGCATTGGAACCAGCTCTGGAATTCTCGGGT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00556:142:HG57WDRXY:1:2101:14660:1016 1:N:0:AAGACATA
GNAAAGCAGGCACCGTGATTTATCTCACTGGAATTCTCGGGTGCCAAGGAA
+
F#F,:FFFFFF,,FFFFFFFF::FFFFF:FF,FFFFF:FFFFF:FF,FFFF
@A00556:142:HG57WDRXY:1:2101:14769:1016 1:N:0:AAGACATA
ANTGCGCGGGGTGGAGCAGCCTGTGCTGGAATTCTCGGGTGCCAAGGAACT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF
@A00556:142:HG57WDRXY:1:2101:16052:1016 1:N:0:AAGACATA
CNCTTTCAAGTAATCCAGGATAGGCTCACTTGGAATTCTCGGGTGCCAAGG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

I have compared the trim results from the file found in miRDeep2 folder :

head -20 ###_L2_S1_L001.fastp.fastq
@A00556:142:HG57WDRXY:1:2101:6126:1016 1:N:0:AAGACATA
CNACGACCTGTACTGACGCTCATGCGTGC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00556:142:HG57WDRXY:1:2101:11153:1016 1:N:0:AAGACATA
ANTATTCTTGACGACCATAGAGCATTGGAACCAGCTC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00556:142:HG57WDRXY:1:2101:14660:1016 1:N:0:AAGACATA
GNAAAGCAGGCACCGTGATTTATCTCAC
+
F#F,:FFFFFF,,FFFFFFFF::FFFFF
@A00556:142:HG57WDRXY:1:2101:14769:1016 1:N:0:AAGACATA
ANTGCGCGGGGTGGAGCAGCCTGTGC
+
F#FFFFFFFFFFFFFFFFFFFFFFFF
@A00556:142:HG57WDRXY:1:2101:16052:1016 1:N:0:AAGACATA
CNCTTTCAAGTAATCCAGGATAGGCTCACT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFF

And the one found in output/mirna_quant/seqcluster/final:

zcat ###_L2_S1_L001.fastp_trimmed.fastq.gz | head -40
@seq_2_x1
CNACGACCTGTACTGACGCTCATGCGTGC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFF
@seq_3_x1
ANTATTCTTGACGACCATAGAGCATTGGAACCAGCTC
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@seq_4_x1
GNAAAGCAGGCACCGTGATTTATCTCAC
+
F#F,:FFFFFF,,FFFFFFFF::FFFFF
@seq_5_x1
ANTGCGCGGGGTGGAGCAGCCTGTGC
+
F#FFFFFFFFFFFFFFFFFFFFFFFF
@seq_6_x1
CNCTTTCAAGTAATCCAGGATAGGCTCACT
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFF

Here are the results if I cut adapters and the 4 nucleotides in 3' and 5' using :

trim_galore --three_prime_clip_R1 4 --clip_R1 4 ###_L2_S1_L001_R1_001.fastq.gz
Cutadapt version: 4.9
trim_galore version: 0.6.7

zcat ###_L2_S1_L001_R1_001_trimmed.fq.gz | head -20
@A00556:142:HG57WDRXY:1:2101:6126:1016 1:N:0:AAGACATA
GACCTGTACTGACGCTCATGC
+
FFFFFFFFFFFFFFFFFFFFF
@A00556:142:HG57WDRXY:1:2101:11153:1016 1:N:0:AAGACATA
TTCTTGACGACCATAGAGCATTGGAACCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00556:142:HG57WDRXY:1:2101:14660:1016 1:N:0:AAGACATA
AGCAGGCACCGTGATTTATC
+
:FFFFFF,,FFFFFFFF::F
@A00556:142:HG57WDRXY:1:2101:14769:1016 1:N:0:AAGACATA
CGCGGGGTGGAGCAGCCT
+
FFFFFFFFFFFFFFFFFF
@A00556:142:HG57WDRXY:1:2101:16052:1016 1:N:0:AAGACATA
TTCAAGTAATCCAGGATAGGCT
+
FFFFFFFFFFFFFFFFFFFFFF

The log file from fastp seems to run this command :

fastp --in1 ###-E12.fastq.gz --out1 ###-E12.fastp.fastq.gz --thread 6 --json ###-E12.fastp.json --html ###-E12.fastp.html --adapter_fasta known_adapters.fa -l 17 --max_len1 40

I mainly wonder if four nucleotides are cut from the 3' and 5' end of the RNA as these should be removed according to the Nextflex pipeline. I have tested with an old version 1.1.0 and the 4 nucleotides from 3' and 5' were cut using the parameter --protocol 'nextflex'.



### Relevant files

_No response_

### System information

Nextflow version: 24.04.3.5916
Hardware: Desktop from Docker
OS: Linux Ubuntu
lpantano commented 1 month ago

Follow up on this. Protocol doesn't work, you need to set up each parameter specifically.

But still, the clipping at 3p happens before adapter removal, so not useful. You could trim before using the pipeline and then skip trimming if you want to use correctly.

apeltzer commented 1 month ago

Should be fine now and addressed in #364 - please give dev a try 👍🏻