nasa / GeneLab_Data_Processing

65 stars 42 forks source link

BulkRNASeq workflow should determine adaptor type automatically #20

Open J-81 opened 1 year ago

J-81 commented 1 year ago

Currently workflow user is expected to replace this value manually in workflow module file. Instead, the adaptor should be automatically determine, perhaps from the raw fastQC reports/multiQC and supplied to the trimming processing.

DPPD Reference

https://github.com/nasa/GeneLab_Data_Processing/blob/0fe1dfd46ee662a333ac49e6013dbd82f86cb987/RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md?plain=1#L207

Workflow Reference

https://github.com/nasa/GeneLab_Data_Processing/blob/0fe1dfd46ee662a333ac49e6013dbd82f86cb987/RNAseq/Workflow_Documentation/NF_RCP-F/workflow_code/modules/quality.nf#L73-L76

J-81 commented 1 year ago

Potential route using within trim_galore adaptor auto-detection: https://github.com/FelixKrueger/TrimGalore/blob/0.6.7/Docs/Trim_Galore_User_Guide.md#adapter-auto-detection

J-81 commented 1 year ago

I'll try using auto-detect by omitting the flag, will of course validate if the auto detect is consistent with direct user supply of the parameter.

J-81 commented 1 year ago

Testing Results using GLDS-426_Truncated (Known to have Nextera adapters):

CURRENT (With --illumina)

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed

This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (301 µs/read; 0.20 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       118 (39.3%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         42,715 bp (94.9%)

With --nextera instead of --illumina

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed

This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (297 µs/read; 0.20 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       233 (77.7%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         33,046 bp (73.4%)

With neither --nextera nor --illumina (i.e. autodetect mode)

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Using Nextera adapter for trimming (count: 113). Second best hit was smallRNA (count: 16)
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed

This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (310 µs/read; 0.19 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       233 (77.7%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         33,046 bp (73.4%)
J-81 commented 1 year ago