h3abionet / HPCBio-Refgraph_pipeline

0 stars 6 forks source link

Trimming optimization #12

Closed yuantianhpc closed 3 years ago

yuantianhpc commented 3 years ago

For trimming with fastp, the following changes improve the results: "--detect_adapter_for_pe" allows autodetection for the trimming of pe reads. "--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" removes these adapters (Illumina) which were found in the HGDP datasets. The code after these changes as below:

fastp -i PEr1.fastq -o PEr1_trim.fastq -I PEr2.fastq -O PEr2_trim.fastq --unpaired1 upr1_trim.fastq --unpaired2 upr2_trim.fastq -l 20 -q 20 --detect_adapter_for_pe --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

fastp -i SEr1.fastq -o SEr1_trim.fastq -l 20 -q 20 --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

fastp -i SEr2.fastq -o SEr2_trim.fastq -l 20 -q 20 --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

cjfields commented 3 years ago

@yuantianhpc Ah you beat me to it! Agreed, this needs to be added, we're seeing these in 1000g samples as well.

cjfields commented 3 years ago

Should also add a sliding window trim from the 3' end. This run got rid of tons of adapter from the read preps extracted from the CRAM files:

fastp --in1 $R1 --in2 $R2 \
    --out1 "test.PE.R1.trimmed.fq"  \
    --out2 "test.PE.R2.trimmed.fq" \
    --unpaired1 "test.unpR1.trimmed.fq" \
    --unpaired2 "test.unpR2.trimmed.fq" \
    --detect_adapter_for_pe  \
    -l 20 --cut_right --cut_right_window_size 3 --cut_right_mean_quality 20 \
    --thread $SLURM_NPROCS -w $SLURM_NPROCS
    --html "window_PE_fastp.html" --json "window_PE_fastp.json"

It's possible we want to parameterize this to make it a little more flexible.

EDIT: @grendon has some possible code for this from another project that would be useful, we can add this in.

grendon commented 3 years ago

what about using both in the code like this:

if you know the adaptor(s) then use these options for PE reads: --adapter_sequence= --adapter_sequence_r2=

if you do not know the adaptor(s) then use this option for PE reads: --detect_adapter_for_pe

cjfields commented 3 years ago

@grendon yes that works. We may want to add some additional screening on the backend (UniVec possibly) but we can plan that for someone at the hackathon to take on.

grendon commented 3 years ago

it makes sense

grendon commented 3 years ago

TThe trimming step has been updated. It works for producing the data for the hackathon. However, it has room for further improvement.

cjfields commented 3 years ago

I slightly tweaked the quality of the sliding window as it removed a ton of R2 reads. This may be something we need to revisit, but for now it works w/ 1000g example data