thegenemyers / DALIGNER

Find all significant local alignments between reads
Other
139 stars 61 forks source link

Alignments vary with block size #36

Closed zyndagj closed 8 years ago

zyndagj commented 8 years ago

Hello,

I have been running FALCON and noticed that the quality of my assembly varied with my chosen block size (-s). Using block sizes of 200, 250, and 300 I started my investigation with the databases after partitioning and trimming with DBsplit.

Input data:

$ DBsplit -x500 -s200 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         5
size = 200000000 cutoff =       500 all = 0
         0         0
     21809     18759
     43137     37294
     65145     56188
     87360     75255
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         18,759 reads        out of          21,809  ( 86.0%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         18,535 reads        out of          21,328  ( 86.9%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         18,894 reads        out of          22,008  ( 85.9%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

         19,067 reads        out of          22,215  ( 85.8%)
$ DBstats raw_reads.5
Statistics for all reads of length 500 bases or more

         15,492 reads        out of          18,091  ( 85.6%)

s250

$ DBsplit -x500 -s250 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         4
size = 250000000 cutoff =       500 all = 0
         0         0
     27121     23399
     54026     46673
     81766     70463
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         23,399 reads        out of          27,121  ( 86.3%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         23,274 reads        out of          26,905  ( 86.5%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         23,790 reads        out of          27,740  ( 85.8%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

         20,284 reads        out of          23,685  ( 85.6%)

s300

$ DBsplit -x500 -s300 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         4
size = 300000000 cutoff =       500 all = 0
         0         0
     32534     28089
     65144     56187
     98318     84642
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         28,089 reads        out of          32,534  ( 86.3%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         28,098 reads        out of          32,610  ( 86.2%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         28,455 reads        out of          33,174  ( 85.8%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

          6,105 reads        out of           7,133  ( 85.6%)

DBsplit Summary

-s # Blocks Total Reads
200 5 90,747
250 4 90,747
300 4 90,747

Since all the total number of reads was consistent between split databases, I moved on to look at the alignments produced by HPCdaligner.

HPCdaligner -M20 -dal128 -t16 -e0.7 -l1000 -s500 -h35 -H12000 raw_reads

s200

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las  raw_reads.5.las
$ LAcat raw_reads > s200.las
$ LAshow ../raw_reads s200 | head -n 2 | tail -n 1
s200: 24,540,600 records

s250

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las
$ LAcat raw_reads > s250.las
$ LAshow ../raw_reads s250.las | head -n 2 | tail -n 1
s250: 23,945,795 records

s300

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las
$ LAcat raw_reads > s300.las
$ LAshow ../raw_reads s300.las | head -n 2 | tail -n 1
s300: 22,533,518 records

Do you know why changing the block size would lead to such differences in the total number of yielded alignments?

Thanks, Greg Zynda

pb-jchin commented 8 years ago

since you use "-t16", all k-mer happen more than 16 times in the block will be filtered out as overlapping seeds. When the block sizes are different, there will be some small discrepancies as the k-mer count is defined within the blocks.

zyndagj commented 8 years ago

Ah, that finally makes sense. For some reason I was registering -t on a per-read basis. The fact that it is a static count instead of a proportion increases the difficulty of automatically tuning parameters for increased sequence depth or genomes of different sizes.

Thanks, Greg

thegenemyers commented 8 years ago

Greg,

Jason answer is correct, in a bigger block -t will likely put more 

k-mers into the "cutoff" category. But also even without -t, if you run with a fixed -M, you can get the same effect as with -M a cutoff is implicitly computed as a function of the available memory: basically with bigger blocks more k-mers need to be cutoff in order to stay within a fixed memory requirement.

-- Gene

On 2/19/16, 8:53 PM, Greg Zynda wrote:

Hello,

I have been running FALCON and noticed that the quality of my assembly varied with my chosen block size (|-s|). Using block sizes of 200, 250, and 300 I started my investigation with the databases after partitioning and trimming with DBsplit.

Input data:

|$ DBsplit -x500 -s200 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 5 size = 200000000 cutoff = 500 all = 0 0 0 21809 18759 43137 37294 65145 56188 87360 75255 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more

      18,759 reads        out of          21,809  ( 86.0%)

$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more

      18,535 reads        out of          21,328  ( 86.9%)

$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more

      18,894 reads        out of          22,008  ( 85.9%)

$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more

      19,067 reads        out of          22,215  ( 85.8%)

$ DBstats raw_reads.5 Statistics for all reads of length 500 bases or more

      15,492 reads        out of          18,091  ( 85.6%)
    s250

|$ DBsplit -x500 -s250 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 4 size = 250000000 cutoff = 500 all = 0 0 0 27121 23399 54026 46673 81766 70463 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more

      23,399 reads        out of          27,121  ( 86.3%)

$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more

      23,274 reads        out of          26,905  ( 86.5%)

$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more

      23,790 reads        out of          27,740  ( 85.8%)

$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more

      20,284 reads        out of          23,685  ( 85.6%)
    s300

|$ DBsplit -x500 -s300 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 4 size = 300000000 cutoff = 500 all = 0 0 0 32534 28089 65144 56187 98318 84642 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more

      28,089 reads        out of          32,534  ( 86.3%)

$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more

      28,098 reads        out of          32,610  ( 86.2%)

$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more

      28,455 reads        out of          33,174  ( 85.8%)

$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more

       6,105 reads        out of           7,133  ( 85.6%)
    DBsplit Summary

-s # Blocks Total Reads 200 5 90,747 250 4 90,747 300 4 90,747

Since all the total number of reads was consistent between split databases, I moved on to look at the alignments produced by HPCdaligner.

|HPCdaligner -M20 -dal128 -t16 -e0.7 -l1000 -s500 -h35 -H12000 raw_reads|

    s200
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las raw_reads.5.las $ LAcat raw_reads > s200.las $ LAshow ../raw_reads s200 head -n 2 tail -n 1 s200: 24,540,600 records
    s250
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las $ LAcat raw_reads > s250.las $ LAshow ../raw_reads s250.las head -n 2 tail -n 1 s250: 23,945,795 records
    s300
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las $ LAcat raw_reads > s300.las $ LAshow ../raw_reads s300.las head -n 2 tail -n 1 s300: 22,533,518 records

Do you know why changing the block size would lead to such differences in the total number of yielded alignments?

Thanks, Greg Zynda

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/36.