ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

kbm2 alignments and wtdbg2 #234

Closed joanam closed 3 years ago

joanam commented 3 years ago

Hi,

As I cannot provide enough memory on a single HPC node, I am following your recommendation to split up the PacBio reads into subfiles and align them against each other with kbm. This works well but if I try to input the kbm2 alignments into wtdbg2, it complains that it cannot find a specific read (see error message below) and outputs 0 nodes. This specific read is found both in the alignments and in the fastq reads file. I am thus not sure why it cannot find this read.

-- total memory      394465808.0 kB
-- available         377611720.0 kB
-- 32 cores
-- Starting program: wtdbg2 -x sq --tidy-reads 5000 -g 400m -t 32 -L 5000 -S 5 -i Mech.mess.mm15-0374.mother.6mio-subreads.fastq.gz -
e 5 --aln-noskip --load-alignments Mech.mess.mm15-0374.mother.6mio.kbmap.multi -o Mech.mess.mm15-0374.mother.6mio.multi --no-read-le
ngth-sort
-- pid                     48297
-- date         Tue May 11 16:56:16 2021
--
[Tue May 11 16:56:16 2021] loading reads
[Tue May 11 17:00:16 2021] Done, 935391 reads (>=5000 bp), 18408359804 bp, 71441322 bins
** PROC_STAT(0) **: real 240.229 sec, user 410.770 sec, sys 27.090 sec, maxrss 5689952.0 kB, maxvsize 6613744.0 kB
KEY PARAMETERS: -k 15 -p 0 -K 1000.049988 -A -S 5.000000 -s 0.050000 -g 400000000 -X 50.000000 -e 5 -L 5000
[Tue May 11 17:00:16 2021] loading alignments from "Mech.mess.mm15-0374.mother.6mio.kbmapC.multi"
 -- Cannot find read "m64094_201016_183743/25/0_6993" in LINE:1 in load_alignments_core -- wtdbg.h:1263 --
 -- Cannot find read "m64094_201016_183743/25/0_6993" in LINE:2 in load_alignments_core -- wtdbg.h:1263 --
 -- Cannot find read "m64094_201016_183743/25/0_6993" in LINE:3 in load_alignments_core -- wtdbg.h:1263 --
 -- Cannot find read "m64094_201016_183743/25/0_6993" in LINE:4 in load_alignments_core -- wtdbg.h:1263 --
(...)
[Tue May 11 17:12:46 2021] sorting rdhits ... Done
[Tue May 11 17:12:46 2021] clipping ... 100.00% bases
[Tue May 11 17:12:47 2021] generating regs ... 0
[Tue May 11 17:12:47 2021] sorting regs ...  Done
[Tue May 11 17:12:47 2021] generating intervals ...  0 intervals
[Tue May 11 17:12:47 2021] selecting important intervals from 0 intervals
[Tue May 11 17:12:47 2021] Intervals: kept 0, discarded 0
** PROC_STAT(0) **: real 991.187 sec, user 1091.780 sec, sys 191.790 sec, maxrss 5749756.0 kB, maxvsize 8770916.0 kB
[Tue May 11 17:12:47 2021] Done, 0 nodes

The read "m64094_201016_183743/25/0_6993" is found in the alignments: grep m64094_201016_183743/25/0_6993 $prefix.kbmap.multi | head -1 m64094_201016_183743/25/0_6993 + 6912 512 6912 m64094_201016_183743/9307191/16896_23094 + 6144 0 6144 1081 6144 113 5 5MI2MD5MIMd2m2MIdmMI2M

and also in the fastq file: zcat $prefix.fastq.gz | grep m64094_201016_183743/25/0_6993 -A 1 | cut -c 1-100 @m64094_201016_183743/9307191/16896_23094 ACACATCTCGTGAGAGTAATATCTGAATCCCAGTTATATTGGCTGATCAGTATGAAAACAACACAAAACATAATGTTTAGTAAATAATATAAATTATAAT

Here the code I used:

Split the fastq file into six subfiles of 1 million reads each

split --additional-suffix=".fastq" -l 4000000 -a 1 \
 <(zcat ${prefix}-subreads.fastq.gz) $prefix.
gzip *fastq

Run kbm2 in an array script with all 36 file combinations

kbm2 -t 24 -d ${prefix}.$i.fastq.gz -i ${prefix}.$j.fastq.gz \
  -o ${prefix}.alignments/${prefix}.kbmap.$i.$j -c

Concatenate all alignments removing self-alignments

cat ${prefix}.alignments/${prefix}.kbmap* | awk '$1 != $6' > ${prefix}.kbmap.multi

Run wtdbg2 using these alignments

wtdbg2 -x sq --tidy-reads 5000 -g 400m -t 32 -L 5000 -S 5 \
 -i ${prefix}-subreads.fastq.gz -e 5 --aln-noskip \
 --load-alignments ${prefix}.kbmap.multi \
 -o ${prefix}.multi --no-read-length-sort

Any idea what might be going wrong here? Any help would be highly appreciated.

Best wishes, Joana

ruanjue commented 3 years ago

The problem should be caused by rename or subsampling. 1, if your sequencing coverge is higher than 50X, please modify the default option -X 50. 2, Maybe wtdbg2 have problen with the option -L 5000, wtdbg2 rename reads, so that it cannot find read name of alignment in its sequences.

joanam commented 3 years ago

Many thanks for the suggestions. I have now filtered out all reads shorter than 5 kb and rerun kbm2 and wtdbg2 with -X 0 and -L 0. It works well now with the following code:

# Split the reads into multiple subfiles
split --additional-suffix=".fastq" -l 4000000 -a 1 \
 <(zcat ${prefix}-subreads.fastq.gz) $prefix.
gzip *fastq

# Run kbm2 with all subfile combinations separately (e.g. in an array script)
kbm2 -t 24 -d ${prefix}.$i.fastq.gz -i ${prefix}.$j.fastq.gz \
  -o ${prefix}.alignments/${prefix}.kbmap.$i.$j -c

# Concatenate all alignments except self-alignments
cat ${prefix}.alignments/${prefix}.kbmap* | awk '$1 != $6' | \
 gzip > ${prefix}.kbmap.alignments.gz

# Run wtdbg2 with kbm2 alignments
wtdbg2 -x sq -g 400m -t 32 -S 5 -L 0 \
 -i ${prefix}.fastq.gz -e 5 --aln-noskip -X 0 \
 --load-alignments ${prefix}.kbmap.alignments.gz \
 -o ${prefix}.multi --no-read-length-sort

I hope this is useful for others who cannot provide enough memory for standard wtdbg2 runs, i.e. who struggle with out of memory errors.

ruanjue commented 3 years ago

Thanks