Still getting Duplicate overlap

thegenemyers / DALIGNER

Find all significant local alignments between reads

Other

138 stars 61 forks source link

Still getting Duplicate overlap #55

Closed hodgett closed 6 years ago

hodgett commented 7 years ago

[ from https://github.com/PacificBiosciences/FALCON/issues/515] We are using Falcon-integrate 1.8.6 and for our large sequences we are still getting the occasional Duplicate Overlap that is killing the whole process.

  raw_reads.55.raw_reads.213.N3: 324,318 all OK
  raw_reads.56.raw_reads.213.C0: 332,440 all OK
  raw_reads.56.raw_reads.213.C1: Duplicate overlap (1573849 vs 6403244)
  raw_reads.56.raw_reads.213.C2: 325,068 all OK
  raw_reads.56.raw_reads.213.C3: 336,228 all OK

Based on our fc_run.cfg config are there any suggestions? We have tried reducing -s 100 and increasing the cutoff but still get the problem. We actually need to reduce the cutoff.

[General]
## config file for FALCON v1.8.6
#job_type = local
job_type = pbs
input_fofn = input.fofn
input_type = raw
length_cutoff = 5000
length_cutoff_pr = 5000
job_queue = lyra

sge_option = -l nodes=1:ppn=2,walltime=128:00:00,mem=20gb -W umask=0007
sge_option_da = -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb -W umask=0007
sge_option_la = -l nodes=1:ppn=2,walltime=96:00:00,mem=24gb -W umask=0007
sge_option_pda = -l nodes=1:ppn=8,walltime=96:00:00,mem=25gb -W umask=0007
sge_option_pla = -l nodes=1:ppn=2,walltime=96:00:00,mem=26gb -W umask=0007
sge_option_fc = -l nodes=1:ppn=1,walltime=96:00:00,mem=25gb -W umask=0007
sge_option_cns = -l nodes=1:ppn=8,walltime=96:00:00,mem=24gb -W umask=0007

pa_concurrent_jobs = 8
cns_concurrent_jobs = 8
ovlp_concurrent_jobs = 8

pa_HPCdaligner_option =  -v -B128 -e.70 -l1000 -s1000 -M24
ovlp_HPCdaligner_option = -v -B128 -h60 -e.96 -l500 -s1000 -M24

pa_DBsplit_option = -x500 -s300
ovlp_DBsplit_option = -x500 -s300

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 0
overlap_filtering_setting = --max_diff 80 --max_cov 80 --min_cov 2 --bestn 10 --n_core 8

pc-dunn says "If this is on the latest DALIGNER, it is a serious problem. I suggest providing a test-case to thegenemyers, the owner of DALIGNER." What information can I provide to assist resolve this, keeping in mind that our dataset is over 100Gb.

thegenemyers commented 7 years ago

I should be able to reproduce the error if I have .fasta files for the two read blocks. You can produce these as follows:

DBshow raw_reads.56 >block.56.fasta DBshow raw_reads.213 >block.213.fasta

The files are likely to be 300-400Mbp each depending on how you split the DB, so you'll need to make them available with something like dropbox or a public ftp download.

Best, Gene

On 1/25/17, 1:33 AM, hodgett wrote:

[ from https://github.com/PacificBiosciences/FALCON/issues/515] We are using Falcon-integrate 1.8.6 and for our large sequences we are still getting the occasional Duplicate Overlap that is killing the whole process.

raw_reads.55.raw_reads.213.N3: 324,318 all OK raw_reads.56.raw_reads.213.C0: 332,440 all OK raw_reads.56.raw_reads.213.C1: Duplicate overlap (1573849 vs 6403244) raw_reads.56.raw_reads.213.C2: 325,068 all OK raw_reads.56.raw_reads.213.C3: 336,228 all OK

Based on our fc_run.cfg config are there any suggestions? We have tried reducing -s 100 and increasing the cutoff but still get the problem. We actually need to reduce the cutoff.

|[General]

config file for FALCON v1.8.6

job_type = local

job_type = pbs input_fofn = input.fofn input_type = raw length_cutoff = 5000 length_cutoff_pr = 5000 job_queue = lyra

sge_option = -l nodes=1:ppn=2,walltime=128:00:00,mem=20gb -W umask=0007 sge_option_da = -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_la = -l nodes=1:ppn=2,walltime=96:00:00,mem=24gb -W umask=0007 sge_option_pda = -l nodes=1:ppn=8,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_pla = -l nodes=1:ppn=2,walltime=96:00:00,mem=26gb -W umask=0007 sge_option_fc = -l nodes=1:ppn=1,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_cns = -l nodes=1:ppn=8,walltime=96:00:00,mem=24gb -W umask=0007

pa_concurrent_jobs = 8 cns_concurrent_jobs = 8 ovlp_concurrent_jobs = 8

pa_HPCdaligner_option = -v -B128 -e.70 -l1000 -s1000 -M24 ovlp_HPCdaligner_option = -v -B128 -h60 -e.96 -l500 -s1000 -M24

pa_DBsplit_option = -x500 -s300 ovlp_DBsplit_option = -x500 -s300

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 0 overlap_filtering_setting = --max_diff 80 --max_cov 80 --min_cov 2 --bestn 10 --n_core 8 |

pc-dunn says "If this is on the latest DALIGNER, it is a serious problem. I suggest providing a test-case to thegenemyers, the owner of DALIGNER." What information can I provide to assist resolve this, keeping in mind that our dataset is over 100Gb.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNn8ZSs-B6Qp7BHtMDYtrhQpPfPr7ks5rVphjgaJpZM4Ls9rS.

hodgett commented 7 years ago

Gene,

I am uploading these blocks as they are from a more current attempt; raw_reads.122.raw_reads.134.C2: Duplicate overlap (3591039 vs 3951857) raw_reads.134.raw_reads.122.C2: Duplicate overlap (3951857 vs 3591039)

https://cloudstor.aarnet.edu.au/plus/index.php/s/vPncp5WgHqbwQuC https://cloudstor.aarnet.edu.au/plus/index.php/s/MclohoOn7WRdjvB

Let me know if you have any trouble getting the files and I hope this helps.

Regards,

Matt

On 25/01/17 16:40, Eugene W Myers Jr wrote:

I should be able to reproduce the error if I have .fasta files for the two read blocks. You can produce these as follows:

DBshow raw_reads.56 >block.56.fasta DBshow raw_reads.213 >block.213.fasta

The files are likely to be 300-400Mbp each depending on how you split the DB, so you'll need to make them available with something like dropbox or a public ftp download.

Best, Gene

On 1/25/17, 1:33 AM, hodgett wrote:

[ from https://github.com/PacificBiosciences/FALCON/issues/515] We are using Falcon-integrate 1.8.6 and for our large sequences we are still getting the occasional Duplicate Overlap that is killing the whole process.

raw_reads.55.raw_reads.213.N3: 324,318 all OK raw_reads.56.raw_reads.213.C0: 332,440 all OK raw_reads.56.raw_reads.213.C1: Duplicate overlap (1573849 vs 6403244) raw_reads.56.raw_reads.213.C2: 325,068 all OK raw_reads.56.raw_reads.213.C3: 336,228 all OK

Based on our fc_run.cfg config are there any suggestions? We have tried reducing -s 100 and increasing the cutoff but still get the problem. We actually need to reduce the cutoff.

|[General]

config file for FALCON v1.8.6

job_type = local

job_type = pbs input_fofn = input.fofn input_type = raw length_cutoff = 5000 length_cutoff_pr = 5000 job_queue = lyra

sge_option = -l nodes=1:ppn=2,walltime=128:00:00,mem=20gb -W umask=0007 sge_option_da = -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_la = -l nodes=1:ppn=2,walltime=96:00:00,mem=24gb -W umask=0007 sge_option_pda = -l nodes=1:ppn=8,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_pla = -l nodes=1:ppn=2,walltime=96:00:00,mem=26gb -W umask=0007 sge_option_fc = -l nodes=1:ppn=1,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_cns = -l nodes=1:ppn=8,walltime=96:00:00,mem=24gb -W umask=0007

pa_concurrent_jobs = 8 cns_concurrent_jobs = 8 ovlp_concurrent_jobs = 8

pa_HPCdaligner_option = -v -B128 -e.70 -l1000 -s1000 -M24 ovlp_HPCdaligner_option = -v -B128 -h60 -e.96 -l500 -s1000 -M24

pa_DBsplit_option = -x500 -s300 ovlp_DBsplit_option = -x500 -s300

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 0 overlap_filtering_setting = --max_diff 80 --max_cov 80 --min_cov 2 --bestn 10 --n_core 8 |

pc-dunn says "If this is on the latest DALIGNER, it is a serious problem. I suggest providing a test-case to thegenemyers, the owner of DALIGNER." What information can I provide to assist resolve this, keeping in mind that our dataset is over 100Gb.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55, or mute the thread

https://github.com/notifications/unsubscribe-auth/AGkkNn8ZSs-B6Qp7BHtMDYtrhQpPfPr7ks5rVphjgaJpZM4Ls9rS.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-275031342, or mute the thread https://github.com/notifications/unsubscribe-auth/AXi0iRNEM-mPMDk_hlB6rne_GZAH4BiDks5rVu5hgaJpZM4Ls9rS.

thegenemyers commented 7 years ago

Great, got both blocks. I'll let you know if I can reproduce the problem at my end soon. Cheers, G

On 1/27/17, 4:33 AM, hodgett wrote:

Gene,

I am uploading these blocks as they are from a more current attempt; raw_reads.122.raw_reads.134.C2: Duplicate overlap (3591039 vs 3951857) raw_reads.134.raw_reads.122.C2: Duplicate overlap (3951857 vs 3591039)

https://cloudstor.aarnet.edu.au/plus/index.php/s/vPncp5WgHqbwQuC https://cloudstor.aarnet.edu.au/plus/index.php/s/MclohoOn7WRdjvB

Let me know if you have any trouble getting the files and I hope this helps.

Regards,

Matt

On 25/01/17 16:40, Eugene W Myers Jr wrote:

I should be able to reproduce the error if I have .fasta files for the two read blocks. You can produce these as follows:

DBshow raw_reads.56 >block.56.fasta DBshow raw_reads.213 >block.213.fasta

The files are likely to be 300-400Mbp each depending on how you split the DB, so you'll need to make them available with something like dropbox or a public ftp download.

Best, Gene

On 1/25/17, 1:33 AM, hodgett wrote:

[ from https://github.com/PacificBiosciences/FALCON/issues/515] We are using Falcon-integrate 1.8.6 and for our large sequences we are still getting the occasional Duplicate Overlap that is killing the whole process.

raw_reads.55.raw_reads.213.N3: 324,318 all OK raw_reads.56.raw_reads.213.C0: 332,440 all OK raw_reads.56.raw_reads.213.C1: Duplicate overlap (1573849 vs 6403244) raw_reads.56.raw_reads.213.C2: 325,068 all OK raw_reads.56.raw_reads.213.C3: 336,228 all OK

Based on our fc_run.cfg config are there any suggestions? We have tried reducing -s 100 and increasing the cutoff but still get the problem. We actually need to reduce the cutoff.

|[General]

config file for FALCON v1.8.6

job_type = local

job_type = pbs input_fofn = input.fofn input_type = raw length_cutoff = 5000 length_cutoff_pr = 5000 job_queue = lyra

sge_option = -l nodes=1:ppn=2,walltime=128:00:00,mem=20gb -W umask=0007 sge_option_da = -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_la = -l nodes=1:ppn=2,walltime=96:00:00,mem=24gb -W umask=0007 sge_option_pda = -l nodes=1:ppn=8,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_pla = -l nodes=1:ppn=2,walltime=96:00:00,mem=26gb -W umask=0007 sge_option_fc = -l nodes=1:ppn=1,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_cns = -l nodes=1:ppn=8,walltime=96:00:00,mem=24gb -W umask=0007

pa_concurrent_jobs = 8 cns_concurrent_jobs = 8 ovlp_concurrent_jobs = 8

pa_HPCdaligner_option = -v -B128 -e.70 -l1000 -s1000 -M24 ovlp_HPCdaligner_option = -v -B128 -h60 -e.96 -l500 -s1000 -M24

pa_DBsplit_option = -x500 -s300 ovlp_DBsplit_option = -x500 -s300

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 0 overlap_filtering_setting = --max_diff 80 --max_cov 80 --min_cov 2 --bestn 10 --n_core 8 |

pc-dunn says "If this is on the latest DALIGNER, it is a serious problem. I suggest providing a test-case to thegenemyers, the owner of DALIGNER." What information can I provide to assist resolve this, keeping in mind that our dataset is over 100Gb.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55, or mute the thread

https://github.com/notifications/unsubscribe-auth/AGkkNn8ZSs-B6Qp7BHtMDYtrhQpPfPr7ks5rVphjgaJpZM4Ls9rS.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-275031342,

or mute the thread

https://github.com/notifications/unsubscribe-auth/AXi0iRNEM-mPMDk_hlB6rne_GZAH4BiDks5rVu5hgaJpZM4Ls9rS.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-275581906, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNjvhZjZrTiAlOMSorjU4oEDljapvks5rWWWZgaJpZM4Ls9rS.

hodgett commented 7 years ago

Excellent, thanks. I can provide more blocks if required, just let me know.

Matthew

----- Reply message ----- From: "Eugene W Myers Jr" notifications@github.com To: "thegenemyers/DALIGNER" DALIGNER@noreply.github.com Cc: "Matthew Hodgett" m.hodgett@qut.edu.au, "Author" author@noreply.github.com Subject: [thegenemyers/DALIGNER] Still getting Duplicate overlap (#55) Date: Fri, Jan 27, 2017 6:11 PM

Great, got both blocks. I'll let you know if I can reproduce the problem at my end soon. Cheers, G

On 1/27/17, 4:33 AM, hodgett wrote:

Gene,

I am uploading these blocks as they are from a more current attempt; raw_reads.122.raw_reads.134.C2: Duplicate overlap (3591039 vs 3951857) raw_reads.134.raw_reads.122.C2: Duplicate overlap (3951857 vs 3591039)

https://cloudstor.aarnet.edu.au/plus/index.php/s/vPncp5WgHqbwQuC https://cloudstor.aarnet.edu.au/plus/index.php/s/MclohoOn7WRdjvB

Let me know if you have any trouble getting the files and I hope this helps.

Regards,

Matt

On 25/01/17 16:40, Eugene W Myers Jr wrote:

I should be able to reproduce the error if I have .fasta files for the two read blocks. You can produce these as follows:

DBshow raw_reads.56 >block.56.fasta DBshow raw_reads.213 >block.213.fasta

The files are likely to be 300-400Mbp each depending on how you split the DB, so you'll need to make them available with something like dropbox or a public ftp download.

Best, Gene

On 1/25/17, 1:33 AM, hodgett wrote:

[ from https://github.com/PacificBiosciences/FALCON/issues/515] We are using Falcon-integrate 1.8.6 and for our large sequences we are still getting the occasional Duplicate Overlap that is killing the whole process.

raw_reads.55.raw_reads.213.N3: 324,318 all OK raw_reads.56.raw_reads.213.C0: 332,440 all OK raw_reads.56.raw_reads.213.C1: Duplicate overlap (1573849 vs 6403244) raw_reads.56.raw_reads.213.C2: 325,068 all OK raw_reads.56.raw_reads.213.C3: 336,228 all OK

Based on our fc_run.cfg config are there any suggestions? We have tried reducing -s 100 and increasing the cutoff but still get the problem. We actually need to reduce the cutoff.

|[General]

config file for FALCON v1.8.6

job_type = local

job_type = pbs input_fofn = input.fofn input_type = raw length_cutoff = 5000 length_cutoff_pr = 5000 job_queue = lyra

sge_option = -l nodes=1:ppn=2,walltime=128:00:00,mem=20gb -W umask=0007 sge_option_da = -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_la = -l nodes=1:ppn=2,walltime=96:00:00,mem=24gb -W umask=0007 sge_option_pda = -l nodes=1:ppn=8,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_pla = -l nodes=1:ppn=2,walltime=96:00:00,mem=26gb -W umask=0007 sge_option_fc = -l nodes=1:ppn=1,walltime=96:00:00,mem=25gb -W umask=0007 sge_option_cns = -l nodes=1:ppn=8,walltime=96:00:00,mem=24gb -W umask=0007

pa_concurrent_jobs = 8 cns_concurrent_jobs = 8 ovlp_concurrent_jobs = 8

pa_HPCdaligner_option = -v -B128 -e.70 -l1000 -s1000 -M24 ovlp_HPCdaligner_option = -v -B128 -h60 -e.96 -l500 -s1000 -M24

pa_DBsplit_option = -x500 -s300 ovlp_DBsplit_option = -x500 -s300

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 0 overlap_filtering_setting = --max_diff 80 --max_cov 80 --min_cov 2 --bestn 10 --n_core 8 |

pc-dunn says "If this is on the latest DALIGNER, it is a serious problem. I suggest providing a test-case to thegenemyers, the owner of DALIGNER." What information can I provide to assist resolve this, keeping in mind that our dataset is over 100Gb.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55, or mute the thread

https://github.com/notifications/unsubscribe-auth/AGkkNn8ZSs-B6Qp7BHtMDYtrhQpPfPr7ks5rVphjgaJpZM4Ls9rS.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-275031342,

or mute the thread

https://github.com/notifications/unsubscribe-auth/AXi0iRNEM-mPMDk_hlB6rne_GZAH4BiDks5rVu5hgaJpZM4Ls9rS.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-275581906, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNjvhZjZrTiAlOMSorjU4oEDljapvks5rWWWZgaJpZM4Ls9rS.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-275610646, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AXi0iSPkm6HjV_VHuxt5jbNU6o0TitSIks5rWaadgaJpZM4Ls9rS.

thegenemyers commented 7 years ago

@hodgett @pb-cdunn So far I have not been able to reproduce the problem. Two questions:

(1) Most importantly I note that y'all are not using the most recent version of daligner as the thread files with extensions .C0, .C1, etc. are no longer produced. This change was made about 4 months ago, and I fixed a "duplicate overlap" problem about 5 months ago. This simply could be because the release of falcon you are using does not include the latest version of daligner. This is why I include Chris Dunn in the discussion. Chris: does falcon now use the version that produces a single .las for a block vs block comparison? I note the repository pacbio/dazzler refers to a quite old version of daligner (from 11 months ago).

(2) I split the blocks with -x500 -s300 and I ran daligner with the defaults save for -s1000 -l1000 and most importantly with -M16 as I only have 16Gb on my Mac. When I get home from my vacation next week, I can run with -M24 and this may make the difference, but seems a bit of a long shot. So I ask, have I got the parameters correct?

Cheers, Gene

pb-cdunn commented 7 years ago

Gene, Does it truly produce no .C0/.C1/...? Or does it produce those in /tmp and delete them? The latter is a problem for us, so we have not yet integrated the code using /tmp. We are up-to-date wrt

commit 8f179db7fb0bf59f34c4a073b85137a592336d89
Author: thegenemyers <gene.myers@gmail.com>
Date:   Sat Aug 6 06:25:10 2016

    Bug fix in "Entwine" manifesting as trace point errors

    About a month ago introduced a bug when improving handling of redundant
    LAs within daligner.  Fixed.

Matthew, you should be using use_tmpdir = true, to eliminate file-system latency as a culprit. But that's a falcon issue. Are you able to reproduce this problem by running only the daligner job, from the command-line using the generated bash script? You could post that bash-script here. (Gene does not care about Falcon parameters or the Falcon workflow.) Maybe -M24 is somehow the cause. What version of DALIGNER are you running? cd DALIGNER; git rev-parse HEAD (That will be in the PacBio repo, but we can still use that information.)

hodgett commented 7 years ago

Gene,

I did try and make sure that the version was the latest. Although the dates included in github for FALCON-integrate 1.8.6 did appear old I noted that the commits for daligner was more up to date. I followed the instructions for installing FALCON-integrate which I can only assume would pull the latest and greatest for Daligner? Running 'git rev-parse HEAD' I get 7c09c812051983ac269d01840a1e8e8b982a06f6

I can also confirm that I also split the blocks with -x500 -s300 and I ran daligner with the defaults save for -s1000 -l1000. I do use -M24, for some earlier trials I increased it to -M32 but noticed performance appeared to suffer.

Matt

On 30/01/17 18:19, Eugene W Myers Jr wrote:

@hodgett https://github.com/hodgett @pb-cdunn https://github.com/pb-cdunn So far I have not been able to reproduce the problem. Two questions:

(1) Most importantly I note that y'all are not using the most recent version of daligner as the thread files with extensions .C0, .C1, etc. are no longer produced. This change was made about 4 months ago, and I fixed a "duplicate overlap" problem about 5 months ago. This simply could be because the release of falcon you are using does not include the latest version of daligner. This is why I include Chris Dunn in the discussion. Chris: does falcon now use the version that produces a single .las for a block vs block comparison? I note the repository pacbio/dazzler refers to a quite old version of daligner (from 11 months ago).

(2) I split the blocks with -x500 -s300 and I ran daligner with the defaults save for -s1000 -l1000 and most importantly with -M16 as I only have 16Gb on my Mac. When I get home from my vacation next week, I can run with -M24 and this may make the difference, but seems a bit of a long shot. So I ask, have I got the parameters correct?

Cheers, Gene

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-276002411, or mute the thread https://github.com/notifications/unsubscribe-auth/AXi0ifNUIW2CbP3Dpm3rHt5wVPiu08AWks5rXZ0vgaJpZM4Ls9rS.

hodgett commented 7 years ago

I did try using |use_tmpdir = true| but it made no improvement. I found it added another layer of complexity in our environment (sometimes the permissions on the storage local to one of the 250 nodes does something unpredictable), and did not alter performance, so I've been leaving it out.

Running 'git rev-parse HEAD' I get 7c09c812051983ac269d01840a1e8e8b982a06f6 which doesn't correlate to Genes DALIGNER commits, but is the latest included with FALCON-integrate. Are there differences between your code and Genes?

I will try find some time over the next day or two to do some more testing wrt running bash script for that individual job.

On 31/01/17 01:41, Christopher Dunn wrote:

Gene, Does it truly produce no .C0/.C1/...? Or does it produce those in /tmp and delete them? The latter is a problem for us, so we have not yet integrated the code using /tmp. We are up-to-date wrt

|commit 8f179db7fb0bf59f34c4a073b85137a592336d89 Author: thegenemyers gene.myers@gmail.com Date: Sat Aug 6 06:25:10 2016 Bug fix in "Entwine" manifesting as trace point errors About a month ago introduced a bug when improving handling of redundant LAs within daligner. Fixed. |

Matthew, you should be using |use_tmpdir = true|, to eliminate file-system latency as a culprit. But that's a falcon issue. Are you able to reproduce this problem by running only the daligner job, from the command-line using the generated bash script? You could post that bash-script here. (Gene does not care about Falcon parameters or the Falcon workflow.) Maybe |-M24| is somehow the cause. What version of DALIGNER are you running? |cd DALIGNER; git rev-parse HEAD| (That will be in the PacBio repo, but we can still use that information.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-276096240, or mute the thread https://github.com/notifications/unsubscribe-auth/AXi0iTpkvxuHJ5DfB1Jv_2-41SbIQ-ooks5rXgSPgaJpZM4Ls9rS.

pb-cdunn commented 7 years ago

use_tmpdir = true

Let's discuss only the DALIGNER issue here. You can re-run daligner and LAsort/LAmerge with a version of DALIGNER from this repo. You do not need to run Falcon at all to debug this.

hodgett commented 7 years ago

I have to admit that I can find nothing that helps me set up these processes individually. How do I go about achieving what you ask?

I have separately tried to incorporate this version of daligner in to falcon, but it simply does not work when compiled in with the tool chain (i.e. replacing the pacbio fork with this version).

pb-cdunn commented 7 years ago

Falcon generates a bash script for each task. You can run the bash script in the task directory for a given daligner job. You can also run the bash script in a merge-task directory. You should be able to reproduce your problem independent of Falcon. Then, Gene can help debug any problem you have with daligner/LAmerge.

hodgett commented 7 years ago

I missed a script. From what I can saw there were 3 scripts. run.sh calls task.sh. task.sh uses parameters from task.json. I missed the rj_xxx.sh script. Thanks for that one, I'll try it and see if I still have the same duplicate overlap.

On 08/02/17 03:05, Christopher Dunn wrote:

Falcon generates a bash script for each task. You can run the bash script in the task directory for a given daligner job. You can also run the bash script in a merge-task directory. You should be able to reproduce your problem independent of Falcon. Then, Gene can help debug any problem you have with daligner/LAmerge.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-278066743, or mute the thread https://github.com/notifications/unsubscribe-auth/AXi0icFEbW3XVV3QLRWAOywpLvGtdPbIks5raKRHgaJpZM4Ls9rS.

hodgett commented 7 years ago

Ok, I have loaded Genes version and run the rj_xxx.sh script. This time I'm getting a new error on a different pair of files that may or may not be related. I am re-running on a different node under less load to see if I get the same result. The original run was performed using settings;

pa_HPCdaligner_option =  -v -B128 -e.70 -l1000 -s1000 -M24
ovlp_HPCdaligner_option = -v -B128 -h60 -e.96 -l500 -s1000 -M24

Output tail from running rj_xxx.sh

LAsort /tmp/raw_reads.213.raw_reads.38.[CN]*.las
LAmerge raw_reads.213.raw_reads.38.las /tmp/raw_reads.213.raw_reads.38.[CN]*.S.las
rm /tmp/raw_reads.213.raw_reads.38.[CN]*.las
LAsort /tmp/raw_reads.38.raw_reads.213.[CN]*.las
LAmerge raw_reads.38.raw_reads.213.las /tmp/raw_reads.38.raw_reads.213.[CN]*.S.las
rm /tmp/raw_reads.38.raw_reads.213.[CN]*.las

Building index for raw_reads.39

   Kmer count = 299,638,863
   Using 8.93Gb of space
   Index occupies 4.46Gb

Comparing raw_reads.213 to raw_reads.39

   Capping mutual k-mer matches over 168 (effectively -t12)
   Hit count = 621,877,828
   Highwater of 23.00Gb space
daligner: Out of memory (Allocating daligner hit vectors)

pb-cdunn commented 7 years ago

Try reducing -M24 well under your actual memory limit.

hodgett commented 7 years ago

-M24 is already well under the limit. Head nodes have 128Gb, nodes have 256Gb.

-- Matthew Hodgett, M.InfTech, CISSP Research Support Specialist | HPC & Research Support Group Queensland University of Technology (QUT) Tel: (+61) 07 3138 9454 | Mob: 0421 824730 | email: m.hodgett@qut.edu.au CRICOS 00213J

On 10/02/17 07:32, Christopher Dunn wrote:

Try reducing |-M24| well under your actual memory limit.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/55#issuecomment-278780526, or mute the thread https://github.com/notifications/unsubscribe-auth/AXi0iZzduzLZ7JET8aRr5IRVrJEVJ8z8ks5ra4XtgaJpZM4Ls9rS.

hodgett commented 7 years ago

I have re-run rj_0111.sh using the latest daligner from Gene and I still have;

  raw_reads.56.raw_reads.213.C1: Duplicate overlap (1573849 vs 6403244)
  raw_reads.56.raw_reads.213.C2: 325,068 all OK
  raw_reads.56.raw_reads.213.C3: 336,228 all OK
  raw_reads.56.raw_reads.213: Duplicate overlap (1573849 vs 6403244)

I have temporarily uploaded the DBshow of these blocks to; https://nextcloud.qriscloud.org.au/index.php/s/Bkkd2Js1C45nPfG https://nextcloud.qriscloud.org.au/index.php/s/k85itrErdK2jSRX

thegenemyers commented 7 years ago

I tried the latest blocks and ran with -M24 and still did not duplicate the error. Could you please send me the script rj_0111.sh so I can be certain that I have duplicated all the flags/options exactly.

Alternatively, could you at your end duplicate what I did here and see if you do or do not get a duplicate overlap error. Specifically, I started from your two files Block.56.fasta and Block.213.fasta and I performed the following commands:

fasta2DB Block56 Block.56.fasta fasta2DB Block213 Block.213.fasta DBsplit -s300 -x500 Block56 DBsplit -s300 -x500 Block213 daligner -v -M24 -e.7 -l1000 -s1000 Block213 Block56 LAcheck -vS Block56 Block213 Block56.Block213.las LAcheck -vS Block213 Block56 Block213.Block56.las

Both calls to LAcheck reported OK. Note that given just the data from the blocks, all I can do is create DB's from each and then run daligner between them, so the context is subtly different than calling daligner on blocks of a much larger DB. Depending on the nature of the problem it may be just enough to change the conditions so the problem does not occur. What we/I need then is a pair of blocks such that the sequence of calls above leads to the problem. If the above is problem-free at your end, then perhaps you could try generating say 20 block .fasta's and trying pairs of them to see if you can get the problem and then send me an offending pair.

Sorry this is not going easy. But keep in mind you we are talking about an event that is occuring once every billion overlaps or so :-)