marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

heterozygous parameters and coverage #439

Closed WIAIW closed 7 years ago

WIAIW commented 7 years ago

Dear All,

I'm new to pacbio assembly and I have several questions about canu. I therefore post my questions here and any ideas about these questions are welcome. I have to assembly an heterozygous diploid genome, but the coverage available is not so high. Could it be sufficient run this job?

canu -p assembly-prefix -d assembly-directory genomeSize=24.0m corErrorRate=0.105 \ minReadLength=500 \ corOutCoverage=80 \

[other-options] \

-pacbio-raw *fastq

Could be better do the three top-level tasks by hand or not?

Best

brianwalenz commented 7 years ago

It's always a good start to run with just the defaults. They do a remarkably good job for most assemblies.

If you aren't running the latest unreleased code from github, upgrade to that. We're very close to releasing 1.5, so you'll be essentially running the next release.

How much coverage do you have? Run canu with just the defaults, and examine the *.report generated. It'll give a read length histogram, coverage, and much more later on in the process.

WIAIW commented 7 years ago

Thank you for your reply. I'm using canu-1.4 The coverage is more or less 20X. This is the reason why I thought to set at least the corErrorRate=0.105. Another question, the species is heterozygous, Do I have to set some parameters for this?

brianwalenz commented 7 years ago

20x is the minimum suggested for any assembly.

Read through http://canu.readthedocs.io/en/latest/faq.html, in particular the 'low coverage' and 'smash haplotypes together' areas. You'll want to increase the allowed error rates to try to correct as many read as possible, and then to assemble haplotypes together as there isn't enough coverage to separate them: correctedErrorRate=0.075 corOutCoverage=200 ovlErrorRate=0.15 obtErrorRate=0.15.

The top of http://canu.readthedocs.io/en/latest/parameter-reference.html might be helpful too, but just as background.

corErrorRate (not to be confused with correctedErrorRate) is needed only for old or especially noisy reads.

WIAIW commented 7 years ago

If I download the S. Cerevisiae subset and run the assembler adding sensitive parameters (correctedErrorRate=0.105) as is shown in the tutorial, canu doesn't work. It tells me Paramter 'correctedErrorRate' is not known. What's the problem?

skoren commented 7 years ago

That is a difference between 1.4 and upcoming 1.5 release. If you update to the latest code you can use the correctedErrorRate parameter. Otherwise see the archived 1.4 version documentation: http://canu.readthedocs.io/en/stable/

How heterozygous is your genome? There's some documented suggestions on the FAQ http://canu.readthedocs.io/en/latest/faq.html#what-parameters-can-i-tweak. However, heterozygosity has the effect of reducing your coverage so each of your haplotypes will be closer to 10X not 20X (assuming diploid) or worse (assuming polyploid)

WIAIW commented 7 years ago

Hello again, I'm also having issues on running Canu on our cluster, the error says:

id: cannot find name for group ID 17033484
-- Canu v1.4 (+0 commits) r7995 7b04cd09002d6b865ca05f4a3f53edb936b5c925.
-- Detected Java(TM) Runtime Environment '1.8.0_121' (from '/home/xdemo002/Downloads/jre1.8.0_121//bin/java').
-- Detected gnuplot version '5.0 patchlevel 6' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 63 gigabytes of memory.
critical error: can't resolve group
-- User supplied Grid Engine environment '-pe make THREADS'.
-- User supplied Grid Engine consumable '-l h_vmem=MEMORY -l mem_free=MEMORY'.
critical error: can't resolve group
--
Undefined subroutine &canu::Configure::caExit called at /home/xdemo002/Downloads/canu-1.4/Linux-amd64/bin/lib/canu/Configure.pm line 192.

and the the line 192 says:

  if ($class eq "grid") {
        my @grid = split '\0', getGlobal("availableHosts");

        if (scalar(@grid) == 0) {
            caExit("invalid useGrid (" . getGlobal("useGrid") . ") and gridEngine (" . getGlobal("gridEngine") . "); found$
        }

I'm trying on running Canu using ecoli dataset. The command is:canu -p ecoli -d ecoli-auto_2 genomeSize=4.8m gridEngineMemoryOption="-l h_vmem=MEMORY -l mem_free=MEMORY" gridEngineThreadsOption="-pe make THREADS" gridOptions="-V -S /bin/sh" useGrid=1 gridEngine="SGE" -pacbio-raw p6.25x.fastq

How do I set up the assembler to run on our cluster?

skoren commented 7 years ago

It sounds like qhost and qconf are returning errors on your system, Canu uses them to figure out the available resources on your grid. Are you able to run those commands by hand (qhost or qconf -sconf)? If you cannot use these on your grid, you can get around it with the latest version in the repo (see issue #392).

WIAIW commented 7 years ago

Yes, I am able to run both the commands (qhost or qconf -sconf) work by hand.

skoren commented 7 years ago

Does this happen right when you run the Canu command (on the submission node) or after Canu submits the first command to the grid? Can all nodes on your grid submit jobs and run qhost/qconf? If not you will have issues running Canu.

Either way, you can use the workarounds in the issue I referenced.

WIAIW commented 7 years ago

This is what happened:

/Downloads/canu-1.4/Linux-amd64/bin$ canu -p ecoli -d ecoli-auto_2 genomeSize=4.8m gridEngineMemoryOption="-l h_vmem=MEMORY -l mem_free=MEMORY" gridEngineThreadsOption="-pe make THREADS" gridOptions="-V -S /bin/sh" useGrid=1 gridEngine="SGE" -pacbio-raw p6.25x.fastq
-- Canu v1.4 (+0 commits) r7995 7b04cd09002d6b865ca05f4a3f53edb936b5c925.
-- Detected Java(TM) Runtime Environment '1.8.0_121' (from '/home/xdemo002/Downloads/jre1.8.0_121//bin/java').
-- Detected gnuplot version '5.0 patchlevel 6' (from 'gnuplot') and image format 'png'.
-- Detected 8 CPUs and 47 gigabytes of memory.
-- User supplied Grid Engine environment '-pe make THREADS'.
-- User supplied Grid Engine consumable '-l h_vmem=MEMORY -l mem_free=MEMORY'.
--
-- Found  10 hosts with  12 cores and   63 GB memory under Sun Grid Engine control.
-- Found   3 hosts with  12 cores and   31 GB memory under Sun Grid Engine control.
-- Found   7 hosts with   4 cores and    7 GB memory under Sun Grid Engine control.
-- Found   1 host  with   4 cores and   15 GB memory under Sun Grid Engine control.
-- Found   7 hosts with  16 cores and   62 GB memory under Sun Grid Engine control.
-- Found   1 host  with  12 cores and  126 GB memory under Sun Grid Engine control.
--
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    6 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    2 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    1 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   4 compute threads and   10 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    8 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   2 compute threads and    6 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    6 GB memory for stage 'minimap (overlapper)'.
----------------------------------------
-- Starting command on Thu Apr 13 16:14:02 2017 with 303.821 GB free disk space

    qsub \
      -l h_vmem=8g \
      -l mem_free=8g \
      -pe make 1 \
      -V \
      -S /bin/sh  \
      -cwd \
      -N "canu_ecoli" \
      -j y \
      -o /home/xdemo002/Downloads/canu-1.4/Linux-amd64/bin/ecoli-auto_2/canu-scripts/canu.01.out /home/xdemo002/Downloads/canu-1.4/Linux-amd64/bin/ecoli-auto_2/canu-scripts/canu.01.sh
Your job 178827 ("canu_ecoli") has been submitted

-- Finished on Thu Apr 13 16:14:02 2017 (lickety-split) with 303.821 GB free disk space

it actually finishes right away. Now I'm going to look the workarounds in the issue you referenced

skoren commented 7 years ago

This is correct, it submits the job to run on the grid and exists. Where was the error you posted before from, canu.out in the run directory? In that case, it seems your grid nodes cannot run qhost/qconf, have you tried running qconf on one of the compute nodes by hand to see if it works? I would also check if the compute nodes on your grid are allowed to run qsub as well. If they cannot you won't be able to use Canu on your grid, you would need to run using gridEngine=remote and manually submit jobs to the grid or run on a single machine.

WIAIW commented 7 years ago

Yes, the error was in canu.out, inside of canu-script folder. Yes I tried running qconf on one of the compute nodes by hand and it worked.

skoren commented 7 years ago

I would check with your IT then if there is anything special required in the submit command or environment to allow you to use qsub/qconf/etc since it is working in your interactive session but not in the Canu run. Based on a quick search, it seems the critical error: can't resolve group is a bug in some version of SGE.