pb-cdunn / FALCON

FALCON: experimental PacBio diploid assembler
Other
0 stars 1 forks source link

Investigate dazcon with perfect reads #2

Open pb-cdunn opened 8 years ago

pb-cdunn commented 8 years ago

Maybe non-perfect reads too.

I sampled an E.coli reference for perfect reads, forward-only. I used Jason's latest settings and got a single contig. (More on that in https://bugzilla.nanofluidics.com/show_bug.cgi?id=32077 )

With dazcon on the perfect reads, I needed more tweaking for a single contig:

dazcon = 1
pa_dazcon_option = -j 4 -t 0 -c 2 -l 1 -xo

Maybe those are not all needed, but this did not work:

pa_dazcon_option = -j 4 -t 0 -c 2 -l 500 -x

I do not know why that didn't work. Since the reads are perfect, they do not require any correction at all, but if I increase the thresholds for correction (-c and -l), things only get worse.

dazcon might still work well for typical data, and it is much faster than fc_consensus. So I'd like to study the differences in their output a bit further.

Here are relative runtimes.

dazcon:

real    2m30.713s
user    15m7.765s
sys     2m39.504s

fc_consensus:

real    4m16.112s
user    40m17.149s
sys     4m5.837s

I'll experiment a bit more with real data, where dazcon might be easier to tune.

pb-cdunn commented 8 years ago

jdrake wrote:

Thanks for the update. This is something I’m keenly interested in learning more about. I’m guessing there’s something pathological buried in the graph traversal (perhaps the scoring) that causes it to pick the wrong paths when using perfect data.

pb-cdunn commented 8 years ago

Well, dazcon seems to work pretty well on regular data. This gave pretty much the same 1-contig result with and without dazcon:

[General]
dazcon = 1
pa_dazcon_option = -j 6 -t 5 -c 4 -l 500 -x
#job_type = local
input_fofn = ecoli.fofn
input_type = raw
genome_size = 4600000
seed_coverage = 135
length_cutoff = -1
length_cutoff_pr = 12000

sge_option_da = -pe smp 4 -q production
sge_option_la = -pe smp 2 -q production
sge_option_cns = -pe smp 6 -q production
sge_option_pda = -pe smp 4 -q production
sge_option_pla = -pe smp 2 -q production
sge_option_fc = -pe smp 8 -q production
pa_concurrent_jobs = 32
cns_concurrent_jobs = 32
ovlp_concurrent_jobs = 32

pa_HPCdaligner_option =   -v -B4 -k14 -t16 -h35 -e.70 -s1000 -l1000
ovlp_HPCdaligner_option = -v -B4 -k14 -t32 -h60 -e.96 -s1000 -l500

pa_DBsplit_option = -x500 -s50
ovlp_DBsplit_option = -x500 -s50

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 6

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 20 --bestn 10 --n_core 24