marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
649 stars 178 forks source link

Bogart failed new #1141

Closed pomidorku closed 5 years ago

pomidorku commented 5 years ago

Hello,

I am running Canu in a High Performance Computing cluster. I used Canu for a bacterial genome assembly (draft) successfully. I am trying to assemble another genome, but the job has failed twice. The message I got was similar for both times the assembly failed: "Bogart failed, tried 2 times, giving up."

I used the following commands:

module load Westmere
module load Canu/1.7-intel-2017A-Perl-5.24.0

canu useGrid=false -p moell -d /scratch/user/user/user2/Moellerella_4_E06/moell_out_out genomeSize=3.3m -pacbio-raw /scratch/user/user/user2/Moellerella_4_E06/m54092_180525_040741.subreadsFQ.fastq

The last lines of the failed job output are:

*****************************************************************************************************
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'oea' concurrent execution on Tue Nov  6 13:26:10 2018 with 470454.578 GB free disk space (11 processes; 40 concurrently)

    cd unitigging/3-overlapErrorAdjustment
    ./oea.sh 1 > ./oea.000001.out 2>&1
    ./oea.sh 2 > ./oea.000002.out 2>&1
    ./oea.sh 3 > ./oea.000003.out 2>&1
    ./oea.sh 4 > ./oea.000004.out 2>&1
    ./oea.sh 5 > ./oea.000005.out 2>&1
    ./oea.sh 6 > ./oea.000006.out 2>&1
    ./oea.sh 7 > ./oea.000007.out 2>&1
    ./oea.sh 8 > ./oea.000008.out 2>&1
    ./oea.sh 9 > ./oea.000009.out 2>&1
    ./oea.sh 10 > ./oea.000010.out 2>&1
    ./oea.sh 11 > ./oea.000011.out 2>&1

-- Finished on Tue Nov  6 14:42:28 2018 (4578 seconds) with 470179.875 GB free disk space
----------------------------------------
-- Found 11 overlap error adjustment output files.
----------------------------------------
-- Starting command on Tue Nov  6 14:42:29 2018 with 470179.859 GB free disk space

    cd unitigging/3-overlapErrorAdjustment
    /general/software/x86_64/easybuild/Westmere/software/Canu/1.7-intel-2017A-Perl-5.24.0/Linux-amd64/bin/ovStoreBuild \
      -G ../moell.gkpStore \
      -O ../moell.ovlStore \
      -evalues \
      -L ./oea.files \
    > ./oea.apply.err 2>&1

-- Finished on Tue Nov  6 14:42:33 2018 (4 seconds) with 470178.921 GB free disk space
----------------------------------------
-- No report available.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'bat' concurrent execution on Tue Nov  6 14:42:33 2018 with 470178.875 GB free disk space (1 processes; 1 concurrently)

    cd unitigging/4-unitigger
    ./unitigger.sh 1 > ./unitigger.000001.out 2>&1

-- Finished on Tue Nov  6 14:42:33 2018 (lickety-split) with 470178.875 GB free disk space
----------------------------------------
--
-- Bogart failed, retry
--
--
-- Running jobs.  Second attempt out of 2.
----------------------------------------
-- Starting 'bat' concurrent execution on Tue Nov  6 14:42:33 2018 with 470178.859 GB free disk space (1 processes; 1 concurrently)

    cd unitigging/4-unitigger
    ./unitigger.sh 1 > ./unitigger.000001.out 2>&1

-- Finished on Tue Nov  6 14:42:33 2018 (lickety-split) with 470178.859 GB free disk space
----------------------------------------
--
-- Bogart failed, tried 2 times, giving up.
--

ABORT:
ABORT: Canu 1.7
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

The output "unitigging" folder does not contain a "4-unitigger" folder (which . It only contains two folders:

0-mercounts 1-overlapper and a link to "moell.gkpStore"

I can see in the lines previous to the error messages that Canu is trying to access a folder that does not exists: "cd unitigging/4-unitigger"

Please, can someone help me? Let me know if you need more information.

Regards,

Marcel

skoren commented 5 years ago

I'm pretty sure the folder is there, it is able to change to it and to run/find the shell script or run the previous step so I suspect you're not looking in the right location. Is it possible your run got cleaned up by the grid (since you're running in scratch)? The unitigger.err file will have more information, I suspect it's related to #1021 or similar. Canu 1.7 is also quite old so it may be a bug which has since been fixed in 1.7.1 or 1.8.

pomidorku commented 5 years ago

Mr Skoren,

Thank you for your answer.

Canu 1.7 did successfully complete a previous genome assembly job. I used the same script for both, the successfully completed and the failed job. I just changed some parameters relevant to the new job.

When looking into the output folder for the successfully completed job, I can see all the folders that are supposed to be there. The failed job is missing several folders (likely, those folders were not created because the job was not completed).

It is unlikely Is that the run got cleaned up by the grid, but I will ask.

One difference I see is that the fastq file for the new bacteria to assemble (the one that Canu fails to assemble) is much larger than the previously completed one. The assembly of the larger one may may much more complex and it demand more resources than the smaller one did. Please, see the resources allocated to the job:

BSUB -n 40 # assigns 40 cores for execution

BSUB -R "span[ptile=40]" # assigns 40 cores per node

BSUB -R "rusage[mem=24500]" # reserves 24.5GB memory per core

BSUB -M 24500 # sets to 25.5GB per process enforceable memory limit. (M * n)

The unitigger.err file reads:


==> PARAMETERS.

Resources: Memory 16 GB Compute Threads 4 (command line)

Lengths: Minimum read 0 bases Minimum overlap 500 bases

Overlap Error Rates: Graph 0.045 (4.500%) Max 0.045 (4.500%)

Deviations: Graph 6.000 Bubble 6.000 Repeat 3.000

Edge Confusion: Absolute 2100 Percent 200.0000

Unitig Construction: Minimum intersection 500 bases Maxiumum placements 2 positions

Debugging Enabled: (none)

==> LOADING AND FILTERING OVERLAPS.

ReadInfo()-- Using 961103 reads, no minimum read length used.

OverlapCache()-- limited to 16384MB memory (user supplied).

OverlapCache()-- 7MB for read data. OverlapCache()-- 36MB for best edges. OverlapCache()-- 95MB for tigs. OverlapCache()-- 25MB for tigs - read layouts. OverlapCache()-- 36MB for tigs - error profiles. OverlapCache()-- 4096MB for tigs - error profile overlaps. OverlapCache()-- 0MB for other processes. OverlapCache()-- --------- OverlapCache()-- 4315MB for data structures (sum of above). OverlapCache()-- --------- OverlapCache()-- 18MB for overlap store structure. OverlapCache()-- 12049MB for overlap data. OverlapCache()-- --------- OverlapCache()-- 16384MB allowed. OverlapCache()-- OverlapCache()-- Retain at least 1527 overlaps/read, based on 763.64x coverage. OverlapCache()-- Initial guess at 821 overlaps/read. OverlapCache()-- OverlapCache()-- Not enough memory to load the minimum number of overlaps; increase -M.


Does it mean Canu did not have enough resources to complete the job?

The content of unitigger.000001.out is:

Running job 1 based on command line options. ./unitigger.sh: line 82: ../moell.ctgStore/seqDB.v001.sizes.txt: No such file or directory

I cannot find the " .moell.ctgStore" folder in the output folder

Best regards,

Marcel

skoren commented 5 years ago

It looks like you were able to find the 4-unitigging folder since you have the unitigger.err log? Yep that error is identical to issue #1021, you've got almost 1000x coverage. You can either re-run unitigger.sh with a larger memory request (I would try at least 32 or maybe 48) and then re-launch Canu or use the subsampling options listed in #1021 which were added in the 1.8 release.

pomidorku commented 5 years ago

Yes, sir. The content of the unitigging folder is:

4-unitigger moell.ovlStore 3-overlapErrorAdjustment 1-overlapper 0-mercounts moell.gkpStore moell.ovlStore.summary moell.ovlStore.per-read.log

The folder 4-unitigger contains: unitigger.sh unitigger.err unitigger.000001.out

I will run Canu 1.8. and let you know if the problem gets solved.

Regards, Marcel

skoren commented 5 years ago

1.8 will not solve the issue on it's own, you would need to specify the random downsampling parameters and re-start from scratch. For this one case, it's probably faster to modify the unitigger.sh script and increase the -M parameter (assuming you have a 32 or 48gb machine available).

pomidorku commented 5 years ago

All the information I sent today, NOV the 7th, comes from the second failed run.

My original post (from NOV the 6th) contained information for the fist failed run. The fails produced different outputs (e.g., no "4-unitigger" folder in the first run.

The "Bogart failed" is common to both.

Regards,

skoren commented 5 years ago

I don't know any way you could get the bogart failed message and not have a 4-unitigger folder or 3-overlapErrorAdjustment folder. Are the two runs of the same data that you re-tried or are these two different genomes? If it's the same data, just use my suggested fixed for the one that has the 4-unitigger folder.

pomidorku commented 5 years ago

Thank you very much. I will try your suggestions.

marcel

skoren commented 5 years ago

Duplicate of #1021

pomidorku commented 5 years ago

Canu 1.8 exits with no error message, but not assembly produced.

For the the previos attempt (I used Canu 1.7) that failed due to not enough memory, I used the following command:

module load Canu/1.7-intel-2017A-Perl-5.24.0
canu useGrid=false -p $prefix -d $assembly_directory genomeSize=$genome_size -pacbio-raw $pacbio_raw_reads

Canu has been recently updated to version 1.8 in out grid:

The people at the IT service recommended me resubmit the full job using the following command:

canu useGrid=true gridOptions="-L /bin/bash -W 168:00 -R 'rusage[mem=14000]' -M 14000" \
preExec="module load Canu/1.8-foss-2017A-Perl-5.24.0-ppc64" java=$EBROOTJAVA/bin/java \
gnuplot=$EBROOTGNUPLOT/bin/gnuplot gnuplotImageFormat=png minThreads=16 \
utgmhapMemory=220 utgmmapMemory=220 utgovlMemory=220 \
-p moellcurie -d /scratch/user/XXX/XXXXXXX/YYYYYYYYYYYYYY_4_E06/moell_out_out genomeSize=3.3m -pacbio-raw /scratch/user/XXX/XXXXXXX/YYYYYYYYYYYYYY_4_E06/m54092_180525_040741.subreadsFQ.fastq

In the stderr file there are several warnings:

------------------------------------------------------------------------------------------------------------------
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_161' (from '/sw/eb/software/JDK/8.0-5.10-ppc64/bin/java') without -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command '/sw/eb/software/gnuplot/5.0.6-foss-2017A-Python-2.7.12/bin/gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 16 CPUs and 247 gigabytes of memory.
-- Detected LSF with 'bsub' binary in /software/lsf/9.1/linux2.6-glibc2.3-ppc64/bin/bsub.
Argument "-" isn't numeric in int at /general/software/ppc64/easybuild/software/Canu/1.8-foss-2017A-Perl-5.24.0-ppc64/Linux-ppc64/bin/../lib/site_perl/canu/Grid_LSF.pm line 152, <F> line 42.
Argument "-" isn't numeric in int at /general/software/ppc64/easybuild/software/Canu/1.8-foss-2017A-Perl-5.24.0-ppc64/Linux-ppc64/bin/../lib/site_perl/canu/Grid_LSF.pm line 152, <F> line 84.
.
.
.
Argument "-" isn't numeric in int at /general/software/ppc64/easybuild/software/Canu/1.8-foss-2017A-Perl-5.24.0-ppc64/Linux-ppc64/bin/../lib/site_perl/canu/Grid_LSF.pm line 152, <F> line 935.
--------------------------------------------------------------------------------------------------------------------------

Another warning:

-- Finished on Mon Nov 12 15:37:05 2018 (76 seconds) with 464436.218 GB free disk space
----------------------------------------
--
-- WARNING: gnuplot failed.
--
----------------------------------------
--
---------------------------------------------------------------------------------------------------------------------------

The final lines of the stderr read:
---------------------------------------------------------------------------------------------------------------------------

--  For 961103 reads with 8409883891 bases, limit to 84 batches.
--  Will count kmers using 02 jobs, each using 10 GB and 16 threads.
--
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
--
-- 'meryl-count.jobSubmit-01.sh' -> job ended(7875587) tasks 1-2.
--
----------------------------------------
-- Starting command on Mon Nov 12 15:37:20 2018 with 464436.203 GB free disk space

    cd /scratch/user/ivr/TARONE/Moellerella_4_E06/moell_out_out
    bsub \
      -w "ended(7875587)"  \
      -L /bin/bash \
      -W 168:00 \
      -R 'rusage[mem=14000]' \
      -M 14000 \
      -M 4096 \
      -R span[hosts=1] \
      -n 1 \
      -J 'canu_moellcurie' \
      -o canu-scripts/canu.01.out  canu-scripts/canu.01.sh
Verifying job submission parameters...

-- Finished on Mon Nov 12 15:37:20 2018 (fast as lightning) with 464436.203 GB free disk space
----------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------

Thank you for your help

pomidorku commented 5 years ago

Just to make sure that I am clear, no contigs.fasta files were produced.

skoren commented 5 years ago

There's no error in your run, your IT group had you use useGrid=true instead of your previous useGrid=false so Canu will keep submitting itself to the grid. As long as you have jobs running in the queue and you don't see an error in the canu.out file, it is fine.

The parameters they had you use I think are incorrect. I wouldn't specify any of the memory you are currently using (utgmhapMemory/etc) and just let canu use defaults. I also would not specify the rusage[mem=14000]' -M 14000 to gridOptions because that will always limit memory requested from the grid by Canu to 14g but you're telling it to use 220g with the other options. Instead you can use batMemory=220

pomidorku commented 5 years ago

Thank you for your prompt answer:

I would probably modify the command to include the "batMemory=220" like so:

canu useGrid=true gridOptions="-L /bin/bash -W 168:00 -R \ preExec="module load Canu/1.8-foss-2017A-Perl-5.24.0-ppc64" java=$EBROOTJAVA/bin/java \ gnuplot=$EBROOTGNUPLOT/bin/gnuplot gnuplotImageFormat=png minThreads=16 \ -p moellcurie -d /scratch/user/ivr/TARONE/Moellerella_4_E06/moell_out_out genomeSize=3.3m \ batMemory=220 \ -pacbio-raw /scratch/user/ivr/TARONE/Moellerella_4_E06/m54092_180525_040741.subreadsFQ.fastq

How about the gnuplot options? after all, they seem to fail.

Regards

skoren commented 5 years ago

Your gnuplot is not running but that's optional, you just won't get any of the png images output from the run.

pomidorku commented 5 years ago

Thank you, I will remove that line from the command.

pomidorku commented 5 years ago

This is a follow up from the job that I reported failed yesterday, the 14th. It is kind of confusing, because the stderr.out is dated on 11/12/2018 and does not report any error. But the canu script reporting that the job failed was produced yesterday, 11/14/2018.

It seems that the job failed. I am attaching the last script produced by Canu in the moell_out/canu-scripts folder, repoting that the job failed.

canu.14.txt

Could you please, suggest me how to get the canu job to end successfully?

Regards,

pomidorku commented 5 years ago

I ask this because although you suggested changes to the script, perhaps the canu error message can help to better understand the reason for this failure.

Regards,

skoren commented 5 years ago

The error indicates the k-mer counting didn't run. There should be more information in the unitigging/0-mercounts folder, post those out files.

pomidorku commented 5 years ago

In the moell_out/unitigging/0-mercounts/ there is a folder named moellcurie.01.meryl.WORKING, which I do not see in a finished job I run before. Also, the moell_out/unitigging/0-mercounts/ folder contains a 1GB core.38019 file that suggest that this is the step run out of resources.

pomidorku commented 5 years ago

The file contained in the unitigging/0-mercounts folder are the following:

ignat.ms22.estMerThresh.err
ignat.ms22.estMerThresh.out
ignat.ms22.frequentMers.fasta
ignat.ms22.histogram
ignat.ms22.histogram.info
ignat.ms22.histogram.lg.gp
ignat.ms22.histogram.lg.png
ignat.ms22.histogram.sm.gp
ignat.ms22.histogram.sm.png
meryl.000001.out
meryl.sh
meryl.success
pomidorku commented 5 years ago

Sorry, the previous has the list of the successful job. Below is the list of the failed job:

moellcurie.01.meryl.WORKING
meryl-configure.sh
meryl-configure.err
moellcurie.ms22.config.01.out
moellcurie.ms22.config.02.out
moellcurie.ms22.config.04.out
moellcurie.ms22.config.06.out
moellcurie.ms22.config.08.out
moellcurie.ms22.config.12.out
moellcurie.ms22.config.16.out
moellcurie.ms22.config.20.out
meryl-count.sh
meryl-make-ignore.pl
meryl-process.sh
moellcurie.ms22.config.24.out
core.38019
meryl-count.jobSubmit-01.out
meryl-count.jobSubmit-01.sh
meryl-count.1.out
skoren commented 5 years ago

Can you post the content of the out files, particularly meryl-count.1.out

brianwalenz commented 5 years ago

Only need meryl-count.1.out and meryl-count.jobSubmit-01.sh (to see how much memory it requested from the grid).

pomidorku commented 5 years ago

I am posting one file from 11/12/2018 (thing were fine) and all the three files from 11/14/2018, when the job failed.:

moellcurie.ms22.config.24_11_12_2018.txt

meryl-count.1.outcount.jobSubmit-01.sh_11_14_2018.txt

meryl-count.jobSubmit-01.outcount.jobSubmit-01.sh_11_14_2018.txt

meryl-count.jobSubmit-01.sh_11_14_2018.txt

pomidorku commented 5 years ago

the names got messed up when posting them, the the two files you requested are there.

pomidorku commented 5 years ago

I have about 250 GB free space in my scratch folder. Do I need more disk space to run canu?

brianwalenz commented 5 years ago

What OS and architecture is this on? The early runs with Canu 1.7 were on Intel, but the later runs with Canu 1.8 appear to be on IBM Power processors? I'm not sure we properly support that.

pomidorku commented 5 years ago

ppc64 IBM 8-core 4.2 GHz Power7+

pomidorku commented 5 years ago

I can still run canu 1.8 in the Intel grid. The IT people advised me to run canu in the other IBM cluster because it had more memory resources available.

brianwalenz commented 5 years ago

Except we (apparently) don't run well there. :-( And without access to such a machine, all my fixes will be wild guesses.

For now, stick with the Intel version. I'd advise restarting from scratch, too - the binary formats might be incompatible and who knows what other weirdness will show up.

Probably also better to use useGrid=false and run on a single node like you were with 1.7. This isn't a big enough genome to really benefit from a grid, and I think it's a bit easier to see what is going on and/or wrong with the run.

skoren commented 5 years ago

Also clean up old assembly folders that didn't finish/you're not running to save space. One of the errors in the logs you posted was a quota limit.

pomidorku commented 5 years ago

Thank you, I will give do it.

pomidorku commented 5 years ago

Hello again,

I am running the assembly again, it is taking about 5 days now. Canu it is in its second attempt now. The first one failed, but it automatically restarted from where it left in the previous attempt

The failure for the first attempt read:

-- Kmer counting (meryl-count) jobs failed, retry. -- job moellcurie.01.meryl FAILED." PS Fail to open output file canu-scripts/canu.13.out: Disk quota exceeded. Output is stored in this mail."

There is now an "canu-scripts/canu.13" file in the HDD.

The IT people increased my disk space to 5TB (about 400GB are occupied now).

Yo mentioned that this genome has about 1000x coverage (haw can you tell?). The IT people at the institution I work told me that they do not know how canu will perform at that high coverage.

Al though the job is still running, the IT people advised me to kill it and reduce the data to 200X coverage.

With regard to coverage reduction, I have had different people advising me different criteria.

Some advised me that I should keep all the larger sequences and remove the ones that are less than 2,000bp. If it fails, remove the sequences smaller than 3,000bp and rerun canu. I should keep removing the smaller sequences (4,000bp, 5,000bp, and so on) until canu can complete the assembly successfully.

Others think that is better to keep about 10-20% of the sequences in the fastq file, selected at random, and run the assembly on that subset.

There are about 1 million sequences in the file.

I am uploading a text file with the summary of the sequences sizes in the fastq file I am trying to use for the assembly.

seqsizes.txt.txt

Would you advise me to reduce the number of sequences in the fastq file? If so, what criterion of the above would you use?

Regards,

pomidorku commented 5 years ago

I forgot to mention that the step that is taking a while is the unitigging one: So far, there is only a 0-mercount folder within the unitigging directory. In the 0-mercount folder there is a sub-folder that keeps updating. The name of that folder is "moellcurie.01.meryl.WORKING". Also, within 0-mercount, the file named "meryl-count.1.out" is constantly being updated. Regards

skoren commented 5 years ago

You can tell the coverage from the unitigging log:

OverlapCache()-- Retain at least 1527 overlaps/read, based on 763.64x coverage.
OverlapCache()-- Initial guess at 821 overlaps/read.

or the report file which will give you the number of reads and bases along with the coverage.

The initial error was out of space, it seems to be running fine now, the WORKING folder means the job is still running. If you do subsample, I recommend random not by length. The longest 200x is not necessarily the highest quality 200x. See http://canu.readthedocs.io/en/latest/parameter-reference.html#readsamplingcoverage for details but you would want something like: readSamplingCoverage=200

pomidorku commented 5 years ago

Dr. Skoren,

Thank you so much for your help. Sorry for the delay in responding to your comment. The thanksgiving holiday is taking up part of my time. I wish you a happy thanksgiving holiday.

Canu has about 72 hours left (of the 168 hours I requested for the job) and it is still working on moellcurie.01.meryl.WORKING.

meryl-count.1.out shows that it has been working on the 9245 kmers for more than a day now (last time it seems to have run out of resources at this step).

The list of kmers canu has been working so far looks like this (I deleted the multiple lines that read: "Memory full. Writing results to './moellcurie.01.meryl.WORKING', using 16 threads.") left only one as an example of the other ones).

Used 48.744 GB out of 10.000 GB to store         7486 kmers.
Memory full.  Writing results to './moellcurie.01.meryl.WORKING', using 16 threads.
Used 48.362 GB out of 10.000 GB to store         2739 kmers.
Used 48.661 GB out of 10.000 GB to store         7253 kmers.
Used 48.652 GB out of 10.000 GB to store         8262 kmers.
Used 48.351 GB out of 10.000 GB to store        12160 kmers.
Used 48.587 GB out of 10.000 GB to store        11099 kmers.
Used 48.450 GB out of 10.000 GB to store         5598 kmers.
Used 48.449 GB out of 10.000 GB to store         7072 kmers.
Used 48.404 GB out of 10.000 GB to store        10317 kmers.
Used 48.342 GB out of 10.000 GB to store        10048 kmers.
Used 48.327 GB out of 10.000 GB to store         2759 kmers.
Used 48.501 GB out of 10.000 GB to store        10577 kmers.
Used 48.367 GB out of 10.000 GB to store        12582 kmers.
Used 48.330 GB out of 10.000 GB to store        12564 kmers.
Used 48.327 GB out of 10.000 GB to store         8148 kmers.
Used 48.492 GB out of 10.000 GB to store        11956 kmers.
Used 48.362 GB out of 10.000 GB to store         2690 kmers.
Used 48.356 GB out of 10.000 GB to store         2237 kmers.
Used 48.345 GB out of 10.000 GB to store         1986 kmers.
Used 48.339 GB out of 10.000 GB to store        13650 kmers.
Used 48.323 GB out of 10.000 GB to store        13654 kmers.
Used 48.488 GB out of 10.000 GB to store        15518 kmers.
Used 48.326 GB out of 10.000 GB to store         5647 kmers.
Used 48.323 GB out of 10.000 GB to store         1873 kmers.
Used 48.321 GB out of 10.000 GB to store         1916 kmers.
Used 48.320 GB out of 10.000 GB to store         1934 kmers.
Used 48.320 GB out of 10.000 GB to store        12270 kmers.
Used 48.320 GB out of 10.000 GB to store        12291 kmers.
Used 48.320 GB out of 10.000 GB to store         9245 kmers.

The last line above (line about 9245 kmers) is followed (so far) by about 3500 lines of: "Memory full. Writing results to './moellcurie.01.meryl.WORKING', using 16 threads"

The time left for this job makes me think that it may not be completed one more time.

Regards, and happy thanksgiving again.

skoren commented 5 years ago

This looks the same as the previous output on PPC, did you switch back to an intel machine?

pomidorku commented 5 years ago

Hello,

I am back. Here is the story of this run. The institution I work for uses some type of "currency" to run jobs in the clusters. Until yesterday, Nov the 26th I had no currency left to run any job at the INTEL cluster. However, the computer scientists that manage the clusters (I call them the "IT people"), seen that my first attempt to assemble this genome failed due to not enough memory resources (I used the INTEL cluster), suggested me to run the job in the IBM cluster, where more memory is available.

Other researchers have successfully run assemblies in the IBM cluster, but they used Canu 1.7. Canu 1.8 has been recently built in the IBM cluster and no test has been performed on how it will behave when assembling large coverage genomes. Now we know that it does not run well at a large coverage (about 1000X), even for a small genome (3.3mbp).

The IT people want me to run the assembly at the IBM cluster again. In their words I could run assemblies "at 50x and run a test build. Then you could assemble at 100x and 200x+ and compare builds".

However, I have now currency to run the assembly at the INTEL cluster, where my first attempt failed due to not enough memory. I plan to run the assembly job at the INTEL cluster, using Canu 1.7. PacBio recommends a 40-fold coverage for haploid genomes of 3.3mbp (like mine).

The question is what n-fold coverage would you recommend? I know the IT people at m work suggested 200X, but I would like to know your opinion. I think I cannot run the full ~1000X coverage due to the previous failures.

Regards,

skoren commented 5 years ago

There were large changes in 1.8 from 1.7 in the meryl step which is what is failing. The coverage doesn't matter, it is likely the IBM system is reporting memory differently than the intel system, causing the issue.

You could run Canu 1.7 on the IBM cluster or 1.8 on the intel machine. Either way, I'd recommend a random 200x coverage (not the longest reads) which if you are running 1.7 you'll have to generate yourself.

pomidorku commented 5 years ago

Thank you for your answer.

I understand that, by default, Canu removes sequences shorter than 1000bp. For this reason I believe that when manually extracting a subset for Canu 1.7 I will have to first remove the sequences shorter than 1000bp.

As for estimating the number of sequences that will produce an ~200X coverage, I could use algebra and estimate (961103*200)/763.64= ~252,000 sequences. This is for running Canu 1.7.

For Canu 1.8 I will use readSamplingCoverage=200.

For the INTEL cluster will I still have to use batMemory=220 for the contigging step?

Regards,

skoren commented 5 years ago

Nope you don't need to use batMemory=220, that is only needed if you don't do any downsampling.

The 1000bp reads are not going to matter too much, they account for less than 10x of your data based on the sequence length histogram you posted. I'm not sure your math is accurate, based on the sequence length histogram you posted earlier, you have about 2500x coverage in reads 1kb or bigger in the input and the average read length is 8.7kb. That would mean to get 200x you want about 75k reads (200*3300000/8700). Alternatively since you want 200 out of 2500x you could just keep 8% of the input data randomly. The two strategies are essentially the same (75k out of your total 961k reads is about 8%).

pomidorku commented 5 years ago

Thank you. I will let you know how the assembly goes.

pomidorku commented 5 years ago

Hi,

I run canu 1.8 in the intel machines and it seems that the 200X-subsampled assembly took less than 1 hour to complete: Starting command on Tue Nov 27 16:38:21
Finished on Tue Nov 27 17:35:35 2018

The canu comand: canu useGrid=false -p moell -d /scratch/user/ivr/TARONE/Moellerella_4_E06/moell_out genomeSize=3.3m readSamplingCoverage=200 -pacbio-raw /scratch/user/ivr/TARONE/Moellerella_4_E06/m54092_180525_040741.subreadsFQ.fastq

The stderr and stdout files: stdout.7909542.txt stderr.7909542.txt

All the files and folders seem to be there in moell_out:

canu-logs
canu-scripts
correction
haplotype
moell.seqStore
trimming
unitigging
core.854
core.2466
core.16544
moell.contigs.fasta
moell.contigs.gfa
moell.contigs.layout
moell.contigs.layout.readToTig
moell.contigs.layout.tigInfo
moell.correctedReads.fasta.gz
moell.report
moell.seqStore.err
moell.seqStore.ssi
moell.trimmedReads.fasta.gz
moell.unassembled.fasta
moell.unitigs.bed
moell.unitigs.fasta
moell.unitigs.gfa
moell.unitigs.layout
moell.unitigs.layout.readToTig
moell.unitigs.layout.tigInfo

The moell_out folder occupies only 3.5GB out of my allocated 5TB.

The "4-unitigger/unitigger.err" file reads: OverlapCache()-- Retain at least 76 overlaps/read, based on 38.28x coverage (is this the final coverage for the assembly?)

Canu 1.8 is very fast in the intel grid. The previous failure in the intel grid occurred when running Canu 1.7. with the full set of raw sequences.

The PI for the project suggested that the full data be divided in 10 sets (or 5 sets) and then run assemblies for each set. Later, generate a larger assembly from these 10 (or 5 assemblies). I believe this may not be such a good idea, because if those subsets are randomly chosen, they may produce similar results. Perhaps it is better to run one assembly with a larger coverage or rerun the full data set in Canu 1.8, making sure that the unitigging step will have enough memory (batMemory=220).

What is you opinion?

Regards,

skoren commented 5 years ago

Yes, it looks like it ran. I was concerned about the core files in your output folder but those look like they were caused by gnuplot so wouldn't have affected the assembly.

You can see the info on the coverage/assembly stats in your stderr file. In particular:

-- In sequence store './moell.seqStore':
--   Found 8184 reads.
--   Found 126335702 bases (38.28 times coverage).
--
--   Read length histogram (one '*' equals 21.87 reads):
--     1000   1999    758 **********************************
--     2000   2999     64 **
--     3000   3999     21 
--     4000   4999      9 
--     5000   5999      9 
--     6000   6999     15 
--     7000   7999     10 
--     8000   8999     21 
--     9000   9999     21 
--    10000  10999     51 **
--    11000  11999     76 ***
--    12000  12999     68 ***
--    13000  13999    190 ********
--    14000  14999   1299 ***********************************************************
--    15000  15999   1531 **********************************************************************
--    16000  16999   1134 ***************************************************
--    17000  17999    810 *************************************
--    18000  18999    618 ****************************
--    19000  19999    434 *******************
--    20000  20999    289 *************
--    21000  21999    216 *********
--    22000  22999    155 *******
--    23000  23999    107 ****
--    24000  24999     85 ***
--    25000  25999     62 **
--    26000  26999     45 **
--    27000  27999     30 *
--    28000  28999     20 
--    29000  29999      6 
--    30000  30999      7 
--    31000  31999      6 
--    32000  32999      3 
--    33000  33999      7 
--    34000  34999      4 
--    35000  35999      2 
--    36000  36999      0 
--    37000  37999      0 
--    38000  38999      0 
--    39000  39999      0 
--    40000  40999      0 
--    41000  41999      0 
--    42000  42999      0 
--    43000  43999      0 
--    44000  44999      0 
--    45000  45999      0 
--    46000  46999      0 
--    47000  47999      1 
--
----------------------------------------
----------------------------------------
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2    503097 *********************************                                      0.1288 0.0083
--       3-     4    249692 ****************                                                       0.1720 0.0125
--       5-     7     97373 ******                                                                 0.2048 0.0171
--       8-    11     34362 **                                                                     0.2209 0.0205
--      12-    16     13372                                                                        0.2274 0.0227
--      17-    22     37447 **                                                                     0.2305 0.0241
--      23-    29    344509 **********************                                                 0.2448 0.0340
--      30-    37   1048826 *********************************************************************  0.3513 0.1288
--      38-    46   1059613 ********************************************************************** 0.6334 0.4439
--      47-    56    447623 *****************************                                          0.8881 0.7925
--      57-    67     55001 ***                                                                    0.9853 0.9517
--      68-    79      5100                                                                        0.9963 0.9734
--      80-    92       872                                                                        0.9973 0.9756
--      93-   106      1106                                                                        0.9975 0.9762
--     107-   121       477                                                                        0.9978 0.9772
--     122-   137       270                                                                        0.9979 0.9776
--     138-   154       419                                                                        0.9980 0.9779
--     155-   172       144                                                                        0.9981 0.9784
--     173-   191       312                                                                        0.9981 0.9785
--     192-   211       139                                                                        0.9982 0.9790
--     212-   232       140                                                                        0.9982 0.9792
--     233-   254       122                                                                        0.9983 0.9795
--     255-   277       222                                                                        0.9983 0.9797
--     278-   301       276                                                                        0.9983 0.9802
--     302-   326       344                                                                        0.9984 0.9809
--     327-   352       508                                                                        0.9985 0.9818
--     353-   379      3689                                                                        0.9986 0.9832
--     380-   407       953                                                                        0.9996 0.9950
--     408-   436       305                                                                        0.9998 0.9975
--     437-   466       229                                                                        0.9999 0.9985
--     467-   497       125                                                                        1.0000 0.9994
--     498-   529        10                                                                        1.0000 0.9999
--     530-   562         0                                                                        0.0000 0.0000
--     563-   596         1                                                                        1.0000 0.9999
--     597-   631         2                                                                        1.0000 0.9999
--     632-   667        13                                                                        1.0000 0.9999
--     668-   704         0                                                                        0.0000 0.0000
--     705-   742         3                                                                        1.0000 1.0000
--     743-   781         1                                                                        1.0000 1.0000
--
--           0 (max occurrences)
--   121547408 (total mers, non-unique)
--     3906697 (distinct mers, non-unique)
--           0 (unique mers)
-- Finished stage 'meryl-process', reset canuIteration.

You can see canu selected the best 38x coverage of your data which was mostly the longer sequences and the k-mer histogram has a peak in the 38-46x coverage range, consistent with your genome size estimate.

-- Found, in version 2, after consensus generation:
--   contigs:      2 sequences, total length 3129801 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  905 sequences, total length 3679235 bp.
--
-- Contig sizes based on genome size 3.3mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     3082147             1     3082147
--     20     3082147             1     3082147
--     30     3082147             1     3082147
--     40     3082147             1     3082147
--     50     3082147             1     3082147
--     60     3082147             1     3082147
--     70     3082147             1     3082147
--     80     3082147             1     3082147
--     90     3082147             1     3082147

You've got a 2-contig 3.1mb assembly which has one 3.08mb contig and a 50kb contig.

As for multiple random sampling or running the full assembly, it's unlikely to help anything. Canu will save short reads if they originate from a plasmid so you should be recovering any plasmids despite the downsampling. The issue is this rescue at very high coverages tends to save too many reads (hence your original error). Once you've got reads long enough to resolve the repeats, adding coverage doesn't really help anything.

You can try running multiple sampling assemblies and compare the assemblies to each other. I expect they will align over the full length, the only "merging" I would see doing is if one contig exists in one sample but has no alignments to the other samplings. I wouldn't expect this to happen though.

pomidorku commented 5 years ago

Thank you for your advise.

This is my first time using supercomputers and assembling genomes (as you can tell from my messages). Now, I will be moving to polishing. The IT people advised me to use arrow. I will let you know how things go. Is there any tutorial on arrow you would recommend? Regards,

pomidorku commented 5 years ago

I can see in the strderr file the core-dumped files related to gnuplot. There are several core files in other folders. Since you brought up the core files, I want to show you the other core files I found in moell_out:

/scratch/user/ivr/TARONE/Moellerella_4_E06/moell_out/unitigging/0-mercounts core.2862

/scratch/user/ivr/TARONE/Moellerella_4_E06/moell_out/trimming/3-overlapbasedtrimming core.2418 core.2416 core.2414 core.2412 core.2410 core.2408 core.2406 core.2404 core.2390

/scratch/user/ivr/TARONE/Moellerella_4_E06/moell_out/trimming/0-mercounts core.1518

/scratch/user/ivr/TARONE/Moellerella_4_E06/moell_out/correction/0-mercounts core.17182

/scratch/user/ivr/TARONE/Moellerella_4_E06/moell_out core.2466 core.854 core.16544

pomidorku commented 5 years ago

I am still puzzled how can you tell the size of the two contigs that form the assembly. I was able to see their sizes in MEGAX for windows, bot not in the stderr file.

skoren commented 5 years ago

Yeah, I think all those core were gnuplot related (you can search for core dump in the stderr file).

The snipped I posted above is from the stderr file (and also the report file) which tells you that there are 2 contigs and the total asm size. Since it also told me the 1st contig size (the NGstats) I can infer the second contig size.

As for arrow, I typically use my own wrapper around SMRTlink utilities (https://github.com/skoren/ArrowGrid) but you can also try the UI version assuming you have that available.

pomidorku commented 5 years ago

To run the arrow grid wrapper will cost me a lot of currency (much more than what I have right now).

The website for arrowgrid I have available reads: "ArrowGrid_HPRC currently only runs on Terra (SLURM). It requires 65,000 "currency units" to submit a job that has 29 subreads.bam files To estimate the number of required "currency units", multiply the number of subreads.bam files 2000 and add 7000 "currency units": (number_of_subreads.bam_files 2000) + 7000"

I do not have that type of currency at the moment (I have about 5000 currency units)

In addition to that, the website reads: "Running arrow on a single node may complete successfully for a bacterial genome with 50x coverage. Polishing large genomes with arrow can take weeks if run on a single compute node. It is recommended to use ArrowGrid to do genome polishing on large genomes starting with your unaligned PacBio subreads.bam files that you receive from the sequencing center."

I understand that for me, ArrowGrid is too expensive (out of reach). Also, ArrowGrid is recommend for large genomes (at least in my institution).

One more thing, the "29 subreads.bam files" referred above means 29 different files, right? In my case I have a single unaligned ".bam file" ("m54092_180525_040741.subreads.bam") that contains about 1,000,000 individual sequences). I will need to sub-sample the file for running arrow or arrowgrid. I read that you recommend to use the raedNames.txt files in the xxx.seqStore to tell arrow what sequences to use. Am I correct?

skoren commented 5 years ago

Yes, 29 means you have 29 cells or 29 bam files. That would be the size for a mammalian genome probably. You can run on one it will probably be not terribly expensive but again you don't really need 2500x so yes, you can use the readNames.txt file to only use the reads you gave the assembly (your 200x subsample) for polishing.