Consensus jobs failed - Githubissues

Malabady commented 4 years ago

Hi,

All consensus jobs were completed successfully except 3 jobs. Rerunning the Canu script didn't solve the problem. I tried to run those 3 jobs interactively from within the 5-consensus folder using the command: sh ./consensus.sh . The job runs for awhile before failing with the following message:

...
...
10942    105409      80        76   23.19x        0    0.00x         4    2.19x
  10943     71488      42        38   18.76x        0    0.00x         4    3.18x
  10944     73170      86        84   21.86x        0    0.00x         2    1.85x
generateTemplateStitch()-- significant size difference, stopping.
utgcns: utgcns/libcns/unitigConsensus.C:557: char* generateTemplateStitch(abAbacus*, tgPosition*, uint32, double, bool): Assertion `(tiglen < 100000) || ((-50.0 <= pd) && (pd <= 50.0))' failed.

Failed with 'Aborted'; backtrace (libbacktrace):
utility/system-stackTrace.C::89 in _Z17AS_UTL_catchCrashiP7siginfoPv()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
utgcns/libcns/unitigConsensus.C::557 in _Z22generateTemplateStitchP8abAbacusP10tgPositionjdb()
utgcns/libcns/unitigConsensus.C::786 in _ZN15unitigConsensus13generatePBDAGEP5tgTigcPSt3mapIjP6sqReadSt4lessIjESaISt4pairIKjS4_EEEPS2_IjP10sqReadDataS6_SaIS7_IS8_SE_EEE()
utgcns/utgcns.C::522 in main()
(null)::0 in (null)()
(null)::0 in (null)()
./consensus.sh: line 103: 37276 Aborted                 (core dumped) $bin/utgcns -S ../run.${tag}Store/partitionedReads.seqStore -T ../run.${tag}Store 1 $jobid -O ./${tag}cns/$jobid.cns.WORKING -maxcoverage 40 -e 0.05 -pbdagcon -edlib -threads 2-8

I tried with two out of the three failed jobs and got similar error. I rerun them with 8 cores and 100gb memory.

Reading through the issues, this seems to have been a bug in earlier releases. But I am using Canu-1.8. Additionally, In a different assembly with smaller part of this data, I didn't run into this issue and the assembly finished successfully.

Any suggestion how to resolve this issue?

Thanks

skoren commented 4 years ago

This is not so much a bug as a check what was removed in subsequent versions, it still exists in 1.8.

Your first try should be to increase the error rate allowed for consensus (edit the -e 0.05 to be -e 0.25 in consensus.sh and re-run the failing jobs by hand using sh consensus.sh <jobnum>). Try this on an interactive session/node with the same memory as the jobs were requesting (8cores 100gb ram). If that still doesn't work, you could comment out the check in the code and recompile or update to 1.9 and re-run the assembly step from the trimmed reads (which will likely produce a better assembly).

Malabady commented 4 years ago

Increasing the error allowed for consensus solved this issue.

On this page of canu-1.9 (https://github.com/marbl/canu/releases/tag/v1.9) it says explicitly that canu-1.9 is NOT compatible with assemblies started with earlier versions. Based on your suggestion, I assume it is okay to use canu-1.9 on canu-1.9-error-corrected reads. Correct?

Is "Canu snapshot v2.0-development +281 changes (r9774 126b9c814200893bba3e0a517d484454a16fe869)" the same as canu-1.9?

the can-1.9 that already compiled and I downloaded from the above-mentioned page, doesn't recognize gridEngineThreadsOption and gridEngineMemoryOption. Is this normal?

Thank you.

skoren commented 4 years ago

Yes, it is incompatible with the binary assembly intermediates. If you run canu -assemble with the trimmed reads that is a new assembly which will recompute overlaps/etc and so there is no incompatibility.
No the snapshot has changes post release and has not been validated as a release has with regression tests. Don't use the development version.
Yes, these options were replaced by a single option gridEngineResourceOption instead to which you can provide both the threads and memory options you were using before (see the documentation here: https://canu.readthedocs.io/en/latest/parameter-reference.html#grid-engine-configuration).

Malabady commented 4 years ago

Please take a look at the following segmentation fault. It is the canu-1.8 job.

-- All 315 consensus jobs finished successfully.
-- Finished stage 'consensusCheck', reset canuIteration.
-- Using slow alignment for consensus (iteration '0').
-- Configured 181 contig and 134 unitig consensus jobs.
-- Using slow alignment for consensus (iteration '0').
-- Configured 181 contig and 134 unitig consensus jobs.
----------------------------------------
-- Starting command on Tue Nov 19 15:21:56 2019 with 486928.045 GB free disk space

    cd unitigging
    /usr/local/apps/eb/canu/1.8-Linux-amd64/bin/tgStoreLoad \
      -S ../run.seqStore \
      -T  ./run.ctgStore 2 \
      -L ./5-consensus/ctgcns.files \
    > ./5-consensus/ctgcns.files.ctgStoreLoad.err 2>&1

-- Finished on Tue Nov 19 15:28:20 2019 (384 seconds) with 486713.45 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Tue Nov 19 15:28:20 2019 with 486713.45 GB free disk space

    cd unitigging
    /usr/local/apps/eb/canu/1.8-Linux-amd64/bin/tgStoreLoad \
      -S ../run.seqStore \
      -T  ./run.utgStore 2 \
      -L ./5-consensus/utgcns.files \
    > ./5-consensus/utgcns.files.utgStoreLoad.err 2>&1

-- Finished on Tue Nov 19 15:29:44 2019 (84 seconds) with 486745.526 GB free disk space
----------------------------------------
-- Purging consensus output after loading to ctgStore and/or utgStore.
-- Purged 315 .cns outputs.
----------------------------------------
-- Starting command on Tue Nov 19 15:29:56 2019 with 487105.905 GB free disk space

    cd unitigging
    /usr/local/apps/eb/canu/1.8-Linux-amd64/bin/tgStoreDump \
      -S ../run.seqStore \
      -T ./run.ctgStore 2 \
      -sizes -s 3600000000 \
    > ./run.ctgStore/seqDB.v002.sizes.txt

Failed with 'Segmentation fault'; backtrace (libbacktrace):
utility/system-stackTrace.C::89 in _Z17AS_UTL_catchCrashiP7siginfoPv()
(null)::0 in (null)()
stores/tgTig.H::297 in _ZN5tgTig19mapGappedToUngappedEj()
stores/tgStoreDump.C::185 in _ZN8tgFilter14ignoreCoverageEP5tgTigb()
stores/tgStoreDump.C::141 in _ZN8tgFilter6ignoreEP5tgTigb()
stores/tgStoreDump.C::444 in _Z9dumpSizesP7sqStoreP7tgStoreR8tgFilterbm()
stores/tgStoreDump.C::1278 in main()
(null)::0 in (null)()
(null)::0 in (null)()
sh: line 4: 210649 Segmentation fault      (core dumped) /usr/local/apps/eb/canu/1.8-Linux-amd64/bin/tgStoreDump -S ../run.seqStore -T ./run.ctgStore 2 -sizes -s 3600000000 > ./r
un.ctgStore/seqDB.v002.sizes.txt

-- Finished on Tue Nov 19 15:32:17 2019 (141 seconds) with 487038.998 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 139.  (rc=35584)
ERROR:

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to generate unitig sizes.
ABORT:

I tried to run the failed command interactively with more memory (150Gb), but it failed with the same error. Does this have anything to do with the consensus error increase that we made to rescue those three jobs?

skoren commented 4 years ago

No I think what's happened is your store has become corrupted during your restart attempts, I'd guess some jobs ran at the same time (that is one or more tigs don't have a valid consensus). No good way to recover from that other that backing up to an earlier assembly step. You could try getting just the assembly from the store to see if that works (run the command above but replace -sizes -s 3600000000 with -contigs -fasta. If this doesn't crash you have your assembly fasta.

If it does crash (and you don't have a backup of your ctgStore folder) you would need to remove the run.ctgStore, run.utgStore, 4-unitigger, 5-consensus folders and re-launch Canu specifying cnsErrorRate=0.25 to avoid the initial error.

Malabady commented 4 years ago

isn't changing the cnsErrorRate to 0.25 from 0.05 a big jump? since I have only three jobs that failed for this reason, what do you think about using cnsErrorRate=0.10? Or, in other words, does it make any significant difference? I assume a high cnsErrorRate can lead to misassemblies.

skoren commented 4 years ago

Setting the high rate won't matter much, if there is a lower overlap identity available, it will be used first. This is only an upper limit and won't affect mis-assemblies as the contigs are already constructed, only their consensus is being computed.

Malabady commented 4 years ago

Thank you so much, Sergay. You mentioned earlier that canu-1.9 will produce a better assembly. Since my dataset is quite large, it will several weeks to go through the trimming and assembly again. Do you think the improvement in the assembly is worth it? I will do anyhow, but I want to get an idea of what kind of improvement to expect.

skoren commented 4 years ago

You don't need to re-run trimming, just assembly. Can't say exactly how much of an improvement as it depends on your reads and genome but we've seen better repeat resolution and higher quality consensus.

Malabady commented 4 years ago

I have run into another issue that I haven't seen before. So, after removing the run.ctgStore, run.utgStore, 4-unitigger, 5-consensus folders and relaunching canu, I got the folllowing problem. I restarted the job multipe times, but it didn't work. I notice that there is no ctgcns or utgcns folders inside the 5-consensus folder.

09:25:12    $ tail -n 30 rosea4/canu.out
--
-- Generating assembly 'run' in '/scratch/malabady/PitcherGenome/PitchPacBio/canu_assembly/rosea4'
--
-- Parameters:
--
--  genomeSize        3600000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0500 (  5.00%)
--    utgOvlErrorRate 0.0500 (  5.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0500 (  5.00%)
--    utgErrorRate    0.0500 (  5.00%)
--    cnsErrorRate    0.0500 (  5.00%)
--
--
-- BEGIN ASSEMBLY
--
--
-- Graph alignment jobs failed, tried 2 times, giving up.
--

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

skoren commented 4 years ago

That's good because it implies consensus in your re-run completed. You technically have the assembly but we can also see why the graph alignment failed. Can you post the output of unitigging/4-unitigger/alignGFA*?

Malabady commented 4 years ago

I found the problem. see the following two failed commands:

/var/spool/torque/mom_priv/jobs/1767894.sapelo2.SC: line 64: 56473 Killed                  $bin/alignGFA -T ../run.ctgStore 2 -i ./run.contigs.gfa -o ./run.contigs.aligned.gfa -t
 32 > ./run.contigs.aligned.gfa.err 2>&1
/var/spool/torque/mom_priv/jobs/1767894.sapelo2.SC: line 76: 58057 Killed                  $bin/alignGFA -bed -T ../run.utgStore 2 -C ../run.ctgStore 2 -i ./run.unitigs.bed -o ./
run.unitigs.aligned.bed -t 32 > ./run.unitigs.aligned.bed.err 2>&1

I ran them interactively and they were completed successfully and the rest of the assembly. see the following stats: sum = 6314580126, n = 27578, ave = 228971.65, largest = 34166068 N50 = 1512943, n = 699 N60 = 390511, n = 1624 N70 = 199849, n = 3991 N80 = 124717, n = 8058 N90 = 86780, n = 14177 N100 = 1012, n = 27578 N_count = 0 Gaps = 0

skoren commented 4 years ago

OK so you've finished the assembly. The stats look OK but it is much larger than your specified genome size of 3.6, how heterozygous is this genome? The histograms in the report file that are output along the way can be given to genomescope to estimate this. You'll probably need to run purge_dups to remove haplotypes from the assembly.

Malabady commented 4 years ago

Based on my genomescope analysis, heterozygosity is less than 1% of this genome. Given that I used corrected reads to do the genomescope analysis, I am kinda skeptical about that low heterozygosity. I have used pruge_dups with an earlier canu assembly of only 50X of this data. Here are the results:

stats for Canu.contigs.fasta sum = 6266922019, n = 23648, ave = 265008.54, largest = 27667202 N50 = 1490962, n = 725 N60 = 437538, n = 1571 N70 = 214621, n = 3720 N80 = 130690, n = 7519 N90 = 87977, n = 13427 N100 = 7781, n = 23648 N_count = 0 Gaps = 0

stats for purged.fa sum = 3132131836, n = 7674, ave = 408148.53, largest = 27667202 N50 = 2900367, n = 275 N60 = 1972553, n = 404 N70 = 918986, n = 626 N80 = 247504, n = 1364 N90 = 115823, n = 3302 N100 = 7781, n = 7674 N_count = 11753 Gaps = 511 stats for hap.fa sum = 3134803293, n = 16426, ave = 190843.98, largest = 27275432 N50 = 315409, n = 1162 N60 = 198902, n = 2439 N70 = 136815, n = 4353 N80 = 100925, n = 7051 N90 = 74213, n = 10648 N100 = 7829, n = 16426 N_count = 1357 Gaps = 59

What do you think?

skoren commented 4 years ago

Corrected reads are normally reliable for estimating heterozygosity but I wouldn't have expected 1% to be separating so much of the genome. You could always compare to an illumina dataset if you have one to get another estimate of zygosity (or align the hap.fa to the purged.fa).

The purge_dups results look ok but I don't think it should be adding gaps, I'm not sure how you ended up with Ns after it. Maybe worth asking on the purge_dups repo about it to clarify.

I'll leave it up to you if you want to run 1.9, your assembly already seems pretty continuous so may not be worth the computational time. I'm going to close the issue since the canu errors you encountered have been resolved and you were able to finish the assembly.

marbl / canu

Consensus jobs failed #1564