Alignments vary with block size

zyndagj commented 8 years ago

Hello,

I have been running FALCON and noticed that the quality of my assembly varied with my chosen block size (-s). Using block sizes of 200, 250, and 300 I started my investigation with the databases after partitioning and trimming with DBsplit.

Input data:

$ DBsplit -x500 -s200 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         5
size = 200000000 cutoff =       500 all = 0
         0         0
     21809     18759
     43137     37294
     65145     56188
     87360     75255
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         18,759 reads        out of          21,809  ( 86.0%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         18,535 reads        out of          21,328  ( 86.9%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         18,894 reads        out of          22,008  ( 85.9%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

         19,067 reads        out of          22,215  ( 85.8%)
$ DBstats raw_reads.5
Statistics for all reads of length 500 bases or more

         15,492 reads        out of          18,091  ( 85.6%)

s250

$ DBsplit -x500 -s250 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         4
size = 250000000 cutoff =       500 all = 0
         0         0
     27121     23399
     54026     46673
     81766     70463
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         23,399 reads        out of          27,121  ( 86.3%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         23,274 reads        out of          26,905  ( 86.5%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         23,790 reads        out of          27,740  ( 85.8%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

         20,284 reads        out of          23,685  ( 85.6%)

s300

$ DBsplit -x500 -s300 raw_reads
$ cat raw_reads.db
files =         3
      37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
      72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
     105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0
blocks =         4
size = 300000000 cutoff =       500 all = 0
         0         0
     32534     28089
     65144     56187
     98318     84642
    105451     90747
$ DBstats raw_reads.1
Statistics for all reads of length 500 bases or more

         28,089 reads        out of          32,534  ( 86.3%)
$ DBstats raw_reads.2
Statistics for all reads of length 500 bases or more

         28,098 reads        out of          32,610  ( 86.2%)
$ DBstats raw_reads.3
Statistics for all reads of length 500 bases or more

         28,455 reads        out of          33,174  ( 85.8%)
$ DBstats raw_reads.4
Statistics for all reads of length 500 bases or more

          6,105 reads        out of           7,133  ( 85.6%)

DBsplit Summary

-s	# Blocks	Total Reads
200	5	90,747
250	4	90,747
300	4	90,747

Since all the total number of reads was consistent between split databases, I moved on to look at the alignments produced by HPCdaligner.

HPCdaligner -M20 -dal128 -t16 -e0.7 -l1000 -s500 -h35 -H12000 raw_reads

s200

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las  raw_reads.5.las
$ LAcat raw_reads > s200.las
$ LAshow ../raw_reads s200 | head -n 2 | tail -n 1
s200: 24,540,600 records

s250

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las
$ LAcat raw_reads > s250.las
$ LAshow ../raw_reads s250.las | head -n 2 | tail -n 1
s250: 23,945,795 records

s300

$ ls
raw_reads.1.las  raw_reads.2.las  raw_reads.3.las  raw_reads.4.las
$ LAcat raw_reads > s300.las
$ LAshow ../raw_reads s300.las | head -n 2 | tail -n 1
s300: 22,533,518 records

Do you know why changing the block size would lead to such differences in the total number of yielded alignments?

Thanks, Greg Zynda

pb-jchin commented 8 years ago

since you use "-t16", all k-mer happen more than 16 times in the block will be filtered out as overlapping seeds. When the block sizes are different, there will be some small discrepancies as the k-mer count is defined within the blocks.

zyndagj commented 8 years ago

Ah, that finally makes sense. For some reason I was registering -t on a per-read basis. The fact that it is a static count instead of a proportion increases the difficulty of automatically tuning parameters for increased sequence depth or genomes of different sizes.

Thanks, Greg

thegenemyers commented 8 years ago

Greg,

Jason answer is correct, in a bigger block -t will likely put more

k-mers into the "cutoff" category. But also even without -t, if you run with a fixed -M, you can get the same effect as with -M a cutoff is implicitly computed as a function of the available memory: basically with bigger blocks more k-mers need to be cutoff in order to stay within a fixed memory requirement.

-- Gene

On 2/19/16, 8:53 PM, Greg Zynda wrote:

Hello,

I have been running FALCON and noticed that the quality of my assembly varied with my chosen block size (|-s|). Using block sizes of 200, 250, and 300 I started my investigation with the databases after partitioning and trimming with DBsplit.

Input data:
https://www.dropbox.com/s/tb78i5i3nrvm6rg/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads.fasta

https://www.dropbox.com/s/v6wwpn40gedj470/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads.fasta
https://www.dropbox.com/s/j61j2cvdxn4dx4g/m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads.fasta
s200
|$ DBsplit -x500 -s200 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 5 size = 200000000 cutoff = 500 all = 0 0 0 21809 18759 43137 37294 65145 56188 87360 75255 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more
      18,759 reads        out of          21,809  ( 86.0%)
$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more
      18,535 reads        out of          21,328  ( 86.9%)
$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more
      18,894 reads        out of          22,008  ( 85.9%)
$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more
      19,067 reads        out of          22,215  ( 85.8%)
$ DBstats raw_reads.5 Statistics for all reads of length 500 bases or more
      15,492 reads        out of          18,091  ( 85.6%)
    s250
|$ DBsplit -x500 -s250 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 4 size = 250000000 cutoff = 500 all = 0 0 0 27121 23399 54026 46673 81766 70463 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more
      23,399 reads        out of          27,121  ( 86.3%)
$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more
      23,274 reads        out of          26,905  ( 86.5%)
$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more
      23,790 reads        out of          27,740  ( 85.8%)
$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more
      20,284 reads        out of          23,685  ( 85.6%)
    s300
|$ DBsplit -x500 -s300 raw_reads $ cat raw_reads.db files = 3 37121 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.3.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 72158 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.1.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 105451 m140913_050931_42139_c100713652400000001823152404301535_s1_p0.2.subreads m140913_050931_42139_c100713652400000001823152404301535_s1_p0 blocks = 4 size = 300000000 cutoff = 500 all = 0 0 0 32534 28089 65144 56187 98318 84642 105451 90747 $ DBstats raw_reads.1 Statistics for all reads of length 500 bases or more
      28,089 reads        out of          32,534  ( 86.3%)
$ DBstats raw_reads.2 Statistics for all reads of length 500 bases or more
      28,098 reads        out of          32,610  ( 86.2%)
$ DBstats raw_reads.3 Statistics for all reads of length 500 bases or more
      28,455 reads        out of          33,174  ( 85.8%)
$ DBstats raw_reads.4 Statistics for all reads of length 500 bases or more
       6,105 reads        out of           7,133  ( 85.6%)
    DBsplit Summary
-s # Blocks Total Reads 200 5 90,747 250 4 90,747 300 4 90,747

Since all the total number of reads was consistent between split databases, I moved on to look at the alignments produced by HPCdaligner.

|HPCdaligner -M20 -dal128 -t16 -e0.7 -l1000 -s500 -h35 -H12000 raw_reads|
    s200
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las raw_reads.5.las $ LAcat raw_reads > s200.las $ LAshow ../raw_reads s200 head -n 2 tail -n 1 s200: 24,540,600 records
    s250
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las $ LAcat raw_reads > s250.las $ LAshow ../raw_reads s250.las head -n 2 tail -n 1 s250: 23,945,795 records
    s300
$ ls raw_reads.1.las raw_reads.2.las raw_reads.3.las raw_reads.4.las $ LAcat raw_reads > s300.las $ LAshow ../raw_reads s300.las head -n 2 tail -n 1 s300: 22,533,518 records

Do you know why changing the block size would lead to such differences in the total number of yielded alignments?

Thanks, Greg Zynda

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/36.

thegenemyers / DALIGNER