wtdbg2 selects too few reads based on coverage

conchoecia commented 4 years ago

Hi Ruanjue,

I've been playing around with parameters for a Sequel II dataset and noticed that wtdbg2 doesn't always choose the correct number of reads based on the parameters -g and -X that I pass. For example, I have a dataset with the following read length distribution properties:

 Minimum     Number     Total
  Read        of       Read
 Length      Reads     Length
--------    -------  -----------
    All   7,038,370  52,025,772,883
   1 kb   6,171,079  51,676,882,569
 2.5 kb   5,499,174  50,504,370,565
   5 kb   3,765,362  43,933,658,258
  10 kb   1,581,721  28,413,461,975
  25 kb     230,852   8,386,651,406
  50 kb      22,233   1,717,409,757
 100 kb       3,517     507,051,817
 250 kb         154      43,078,836
 500 kb           1         548,263
   1 mb           0               0

I pass the parameters -g 250m and -X 200. This should select 50Gb of data, but I also pass the option -L 5000. So my guess is that the program should select the 3,765,362 reads that are >=5000bp, totaling to 43,933,658,258 bp?

Instead, what happens is wtdbg2 selects 29Gb in 2,133,212 reads. Maybe I'm missing something but it seems like wtdbg2 is missing a lot of reads. What do you think? I'm using wtdbg2 2.4 but I am not sure of the specific commit.

[Tue Jan 14 09:35:00 2020] loading reads
2133212 reads
[Tue Jan 14 09:39:07 2020] Done, 2133212 reads (>=5000 bp), 29384925345 bp, 113719440 bi
ns
** PROC_STAT(0) **: real 246.908 sec, user 466.420 sec, sys 66.120 sec, maxrss 8941772.0
 kB, maxvsize 9500712.0 kB
KEY PARAMETERS: -k 17 -p 0 -K 1000.049988 -A -S 2.000000 -s 0.050000 -g 250000000 -X 200
.000000 -e 3 -L 5000

ruanjue commented 4 years ago

wtdbg2 -L 5000 not only filter shorter reads, but also try to select longest subreads for pacbio reads. Because your input data is only 2133212 reads, less than 50X, wtdbg2 selected all of them.

conchoecia commented 4 years ago

Thanks for your response - that is the problem though, I am actually giving the program more than 2133212 reads. There are 3765362 reads that are >=5kb, and that only constitutes 43Gb of data (I requested 50Gb by using -g 250m and -X 200). So, shouldn't the program actually use all 3765362 reads?

bioawk -cfastx '{if (length($seq) > 4999){counter +=1}} END{print(counter)}' my_reads.fasta.gz
3765362

ruanjue commented 4 years ago

From smartdenovo, we filter shorter subreads to avoid chimera. Please use the scripts/rename_fa.pl to rename reads.

wtdbg2 -x sq .... -i <(scripts/rename_fa.pl your_reads.fa)

conchoecia commented 4 years ago

That did the trick - wtdbg2 now read in all 3765362 as I wanted it to. Thank you!

Just curious, what do you mean shorter subreads and how do you filter them? I know that the last field of PacBio headers is the coordinates in the polymerase read for the subread in question:

fasta header                            read length
m64069_200103_200726/1/0_6457             6457
m64069_200103_200726/3/6807_16581         9774

ruanjue commented 4 years ago

Yes, by parsing read anme, I select the longest sub-read from a polymer-read.

ruanjue / wtdbg2

wtdbg2 selects too few reads based on coverage #171