Closed conchoecia closed 4 years ago
wtdbg2 -L 5000
not only filter shorter reads, but also try to select longest subreads for pacbio reads. Because your input data is only 2133212 reads, less than 50X, wtdbg2 selected all of them.
Thanks for your response - that is the problem though, I am actually giving the program more than 2133212 reads. There are 3765362 reads that are >=5kb, and that only constitutes 43Gb of data (I requested 50Gb by using -g 250m and -X 200). So, shouldn't the program actually use all 3765362 reads?
bioawk -cfastx '{if (length($seq) > 4999){counter +=1}} END{print(counter)}' my_reads.fasta.gz
3765362
From smartdenovo, we filter shorter subreads to avoid chimera. Please use the scripts/rename_fa.pl
to rename reads.
wtdbg2 -x sq .... -i <(scripts/rename_fa.pl your_reads.fa)
That did the trick - wtdbg2
now read in all 3765362 as I wanted it to. Thank you!
Just curious, what do you mean shorter subreads and how do you filter them? I know that the last field of PacBio headers is the coordinates in the polymerase read for the subread in question:
fasta header read length
m64069_200103_200726/1/0_6457 6457
m64069_200103_200726/3/6807_16581 9774
Yes, by parsing read anme, I select the longest sub-read from a polymer-read.
Hi Ruanjue,
I've been playing around with parameters for a Sequel II dataset and noticed that wtdbg2 doesn't always choose the correct number of reads based on the parameters -g and -X that I pass. For example, I have a dataset with the following read length distribution properties:
I pass the parameters -g 250m and -X 200. This should select 50Gb of data, but I also pass the option -L 5000. So my guess is that the program should select the 3,765,362 reads that are
>=5000bp
, totaling to 43,933,658,258 bp?Instead, what happens is wtdbg2 selects 29Gb in 2,133,212 reads. Maybe I'm missing something but it seems like
wtdbg2
is missing a lot of reads. What do you think? I'm using wtdbg2 2.4 but I am not sure of the specific commit.