Closed yuantianhpc closed 3 years ago
@yuantianhpc Ah you beat me to it! Agreed, this needs to be added, we're seeing these in 1000g samples as well.
Should also add a sliding window trim from the 3' end. This run got rid of tons of adapter from the read preps extracted from the CRAM files:
fastp --in1 $R1 --in2 $R2 \
--out1 "test.PE.R1.trimmed.fq" \
--out2 "test.PE.R2.trimmed.fq" \
--unpaired1 "test.unpR1.trimmed.fq" \
--unpaired2 "test.unpR2.trimmed.fq" \
--detect_adapter_for_pe \
-l 20 --cut_right --cut_right_window_size 3 --cut_right_mean_quality 20 \
--thread $SLURM_NPROCS -w $SLURM_NPROCS
--html "window_PE_fastp.html" --json "window_PE_fastp.json"
It's possible we want to parameterize this to make it a little more flexible.
EDIT: @grendon has some possible code for this from another project that would be useful, we can add this in.
what about using both in the code like this:
if you know the adaptor(s) then use these options for PE reads: --adapter_sequence=
if you do not know the adaptor(s) then use this option for PE reads: --detect_adapter_for_pe
@grendon yes that works. We may want to add some additional screening on the backend (UniVec possibly) but we can plan that for someone at the hackathon to take on.
it makes sense
TThe trimming step has been updated. It works for producing the data for the hackathon. However, it has room for further improvement.
I slightly tweaked the quality of the sliding window as it removed a ton of R2 reads. This may be something we need to revisit, but for now it works w/ 1000g example data
For trimming with fastp, the following changes improve the results: "--detect_adapter_for_pe" allows autodetection for the trimming of pe reads. "--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" removes these adapters (Illumina) which were found in the HGDP datasets. The code after these changes as below:
fastp -i PEr1.fastq -o PEr1_trim.fastq -I PEr2.fastq -O PEr2_trim.fastq --unpaired1 upr1_trim.fastq --unpaired2 upr2_trim.fastq -l 20 -q 20 --detect_adapter_for_pe --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
fastp -i SEr1.fastq -o SEr1_trim.fastq -l 20 -q 20 --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
fastp -i SEr2.fastq -o SEr2_trim.fastq -l 20 -q 20 --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT