Closed joanam closed 3 years ago
The problem should be caused by rename
or subsampling
. 1, if your sequencing coverge is higher than 50X, please modify the default option -X 50
. 2, Maybe wtdbg2 have problen with the option -L 5000
, wtdbg2 rename reads, so that it cannot find read name of alignment in its sequences.
Many thanks for the suggestions. I have now filtered out all reads shorter than 5 kb and rerun kbm2 and wtdbg2 with -X 0 and -L 0. It works well now with the following code:
# Split the reads into multiple subfiles
split --additional-suffix=".fastq" -l 4000000 -a 1 \
<(zcat ${prefix}-subreads.fastq.gz) $prefix.
gzip *fastq
# Run kbm2 with all subfile combinations separately (e.g. in an array script)
kbm2 -t 24 -d ${prefix}.$i.fastq.gz -i ${prefix}.$j.fastq.gz \
-o ${prefix}.alignments/${prefix}.kbmap.$i.$j -c
# Concatenate all alignments except self-alignments
cat ${prefix}.alignments/${prefix}.kbmap* | awk '$1 != $6' | \
gzip > ${prefix}.kbmap.alignments.gz
# Run wtdbg2 with kbm2 alignments
wtdbg2 -x sq -g 400m -t 32 -S 5 -L 0 \
-i ${prefix}.fastq.gz -e 5 --aln-noskip -X 0 \
--load-alignments ${prefix}.kbmap.alignments.gz \
-o ${prefix}.multi --no-read-length-sort
I hope this is useful for others who cannot provide enough memory for standard wtdbg2 runs, i.e. who struggle with out of memory errors.
Thanks
Hi,
As I cannot provide enough memory on a single HPC node, I am following your recommendation to split up the PacBio reads into subfiles and align them against each other with kbm. This works well but if I try to input the kbm2 alignments into wtdbg2, it complains that it cannot find a specific read (see error message below) and outputs 0 nodes. This specific read is found both in the alignments and in the fastq reads file. I am thus not sure why it cannot find this read.
The read "m64094_201016_183743/25/0_6993" is found in the alignments:
grep m64094_201016_183743/25/0_6993 $prefix.kbmap.multi | head -1
m64094_201016_183743/25/0_6993 + 6912 512 6912 m64094_201016_183743/9307191/16896_23094 + 6144 0 6144 1081 6144 113 5 5MI2MD5MIMd2m2MIdmMI2Mand also in the fastq file:
zcat $prefix.fastq.gz | grep m64094_201016_183743/25/0_6993 -A 1 | cut -c 1-100
@m64094_201016_183743/9307191/16896_23094 ACACATCTCGTGAGAGTAATATCTGAATCCCAGTTATATTGGCTGATCAGTATGAAAACAACACAAAACATAATGTTTAGTAAATAATATAAATTATAATHere the code I used:
Split the fastq file into six subfiles of 1 million reads each
Run kbm2 in an array script with all 36 file combinations
Concatenate all alignments removing self-alignments
Run wtdbg2 using these alignments
Any idea what might be going wrong here? Any help would be highly appreciated.
Best wishes, Joana