Closed zyndagj closed 8 years ago
since you use "-t16", all k-mer happen more than 16 times in the block will be filtered out as overlapping seeds. When the block sizes are different, there will be some small discrepancies as the k-mer count is defined within the blocks.
Ah, that finally makes sense. For some reason I was registering -t
on a per-read basis. The fact that it is a static count instead of a proportion increases the difficulty of automatically tuning parameters for increased sequence depth or genomes of different sizes.
Thanks, Greg
Greg,
Jason answer is correct, in a bigger block -t will likely put more
k-mers into the "cutoff" category. But also even without -t, if you run with a fixed -M, you can get the same effect as with -M a cutoff is implicitly computed as a function of the available memory: basically with bigger blocks more k-mers need to be cutoff in order to stay within a fixed memory requirement.
-- Gene
On 2/19/16, 8:53 PM, Greg Zynda wrote:
Hello,
I have been running FALCON and noticed that the quality of my assembly varied with my chosen block size (|-s|). Using block sizes of 200, 250, and 300 I started my investigation with the databases after partitioning and trimming with DBsplit.
Input data:
- https://www.dropbox.com/s/tb78i5i3nrvm6rg/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads.fasta
- https://www.dropbox.com/s/v6wwpn40gedj470/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads.fasta
s200
|$ DBsplit -x500 -s200 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 5 size = 200000000 cutoff = 500 all = 0 0 0 21809 18759 43137 37294 65145 56188 87360 75255 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more
18,759 reads out of 21,809 ( 86.0%)
$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more
18,535 reads out of 21,328 ( 86.9%)
$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more
18,894 reads out of 22,008 ( 85.9%)
$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more
19,067 reads out of 22,215 ( 85.8%)
$ DBstats raw_reads.5 Statistics for all reads of length 500 bases or more
15,492 reads out of 18,091 ( 85.6%)
s250
|$ DBsplit -x500 -s250 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 4 size = 250000000 cutoff = 500 all = 0 0 0 27121 23399 54026 46673 81766 70463 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more
23,399 reads out of 27,121 ( 86.3%)
$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more
23,274 reads out of 26,905 ( 86.5%)
$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more
23,790 reads out of 27,740 ( 85.8%)
$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more
20,284 reads out of 23,685 ( 85.6%)
s300
|$ DBsplit -x500 -s300 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 4 size = 300000000 cutoff = 500 all = 0 0 0 32534 28089 65144 56187 98318 84642 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more
28,089 reads out of 32,534 ( 86.3%)
$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more
28,098 reads out of 32,610 ( 86.2%)
$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more
28,455 reads out of 33,174 ( 85.8%)
$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more
6,105 reads out of 7,133 ( 85.6%)
DBsplit Summary
-s # Blocks Total Reads 200 5 90,747 250 4 90,747 300 4 90,747
Since all the total number of reads was consistent between split databases, I moved on to look at the alignments produced by HPCdaligner.
|HPCdaligner -M20 -dal128 -t16 -e0.7 -l1000 -s500 -h35 -H12000 raw_reads|
s200
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las raw_reads.5.las $ LAcat raw_reads > s200.las $ LAshow ../raw_reads s200 head -n 2 tail -n 1 s200: 24,540,600 records s250
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las $ LAcat raw_reads > s250.las $ LAshow ../raw_reads s250.las head -n 2 tail -n 1 s250: 23,945,795 records s300
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las $ LAcat raw_reads > s300.las $ LAshow ../raw_reads s300.las head -n 2 tail -n 1 s300: 22,533,518 records Do you know why changing the block size would lead to such differences in the total number of yielded alignments?
Thanks, Greg Zynda
— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/36.
Hello,
I have been running FALCON and noticed that the quality of my assembly varied with my chosen block size (
-s
). Using block sizes of 200, 250, and 300 I started my investigation with the databases after partitioning and trimming with DBsplit.Input data:
s200
s250
s300
DBsplit Summary
Since all the total number of reads was consistent between split databases, I moved on to look at the alignments produced by HPCdaligner.
HPCdaligner -M20 -dal128 -t16 -e0.7 -l1000 -s500 -h35 -H12000 raw_reads
s200
s250
s300
Do you know why changing the block size would lead to such differences in the total number of yielded alignments?
Thanks, Greg Zynda