marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

can Canu 2.2 assemble unknown viral genomes? #2079

Closed katievigil closed 2 years ago

katievigil commented 2 years ago

Hi am I able to assemble genomes for metagenomic virus nanopore sequences for where I do not know the exact genome sizes?

Thanks!

brianwalenz commented 2 years ago

Sure can! The genome size isn't a critical parameter. it's used for summary statistics and deciding how much of the raw data should be corrected and used for assembly. After correction, there will be approximately genomeSize * corOutCoverage bases used for assembly.

For metagenomics, you typically want to use all the data, so setting to an expected total genome size in the sample would work.

Something like:

genomeSize=20m
maxInputCoverage=10000 corOutCoverage=10000
corMhapSensitivity=high
corMinCoverage=0
redMemory=32 oeaMemory=32 batMemory=64

One reason for correcting and assembling all the data is that we correct only the longest reads and it is quite possible that the virus will be only in the shortest reads in the set.

This came from the bottom of https://canu.readthedocs.io/en/latest/faq.html#what-parameters-can-i-tweak.

katievigil commented 2 years ago

Hi I have an error using your suggested code: "failed to partition reads" Canu V2.2

Found perl:
   /lustre/project/taw/share/conda-envs/ONRviral/bin/perl
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "C.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
   This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi

Found java:
   /lustre/project/taw/share/conda-envs/ONRviral/bin/java
   openjdk version "10.0.2" 2018-07-17

Found canu:
   /lustre/project/taw/share/conda-envs/ONRviral/bin/canu
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "C.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
   canu 2.2

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
    LANGUAGE = (unset),
    LC_ALL = (unset),
    LANG = "C.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
-- canu 2.2
--
-- CITATIONS
--
-- For 'standard' assemblies of PacBio or Nanopore reads:
--   Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
--   Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
--   Genome Res. 2017 May;27(5):722-736.
--   http://doi.org/10.1101/gr.215087.116
-- 
-- Read and contig alignments during correction and consensus use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '10.0.2' (from '/lustre/project/taw/share/conda-envs/ONRviral/bin/java') without -d64 support.
-- Detected gnuplot version '5.4 patchlevel 3   ' (from 'gnuplot') and image format 'png'.
--
-- Detected 1 CPUs and 4096 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /cm/shared/apps/slurm/14.03.0/bin/sinfo.
-- Detected Slurm with task IDs up to 1000 allowed.
-- 
-- Slurm support detected.  Resources available:
--     33 hosts with  20 cores and   62 GB memory.
--     41 hosts with  20 cores and  249 GB memory.
--     47 hosts with  20 cores and  124 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     12.000 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8.000 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    6.000 GB   10 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     4.000 GB    5 CPUs  (overlap detection)
-- Grid:  utgovl     4.000 GB    5 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       32.000 GB    4 CPUs  (read error detection)
-- Grid:  oea       32.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       64.000 GB    4 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    4 CPUs  (consensus)
--
-- Found Nanopore reads in 'barcode01canu.seqStore':
--   Libraries:
--     Nanopore:              1
--   Reads:
--     Corrected:             24928929
--     Corrected and Trimmed: 24928929
--
--
-- Generating assembly 'barcode01canu' in '/lustre/project/taw/ONR012021/canu/barcode01':
--   genomeSize:
--     2000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.3200 ( 32.00%)
--     obtOvlErrorRate 0.1200 ( 12.00%)
--     utgOvlErrorRate 0.1200 ( 12.00%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.3000 ( 30.00%)
--     obtErrorRate    0.1200 ( 12.00%)
--     utgErrorRate    0.1200 ( 12.00%)
--     cnsErrorRate    0.2000 ( 20.00%)
--
--   Stages to run:
--     assemble corrected and trimmed reads.
--
--
-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.
--
-- BEGIN ASSEMBLY
--
-- Loading read lengths.
-- Loading number of overlaps per read.
--
-- WARNING:
-- WARNING: Found no overlaps.  Disabling Overlap Error Adjustment.
-- WARNING:
--
-- Unitigger finished successfully.
-- Finished stage 'unitigCheck', reset canuIteration.
----------------------------------------
-- Starting command on Fri Feb  4 13:49:54 2022 with 283848.661 GB free disk space

    cd unitigging
    /lustre/project/taw/share/conda-envs/ONRviral/bin/utgcns \
      -S ../barcode01canu.seqStore \
      -T  ./barcode01canu.ctgStore 1 \
      -partition 0.8 1 0.1 \
    > ./barcode01canu.ctgStore/partitioning.log 2>&1
sh: line 1: ./barcode01canu.ctgStore/partitioning.log: No such file or directory

-- Finished on Fri Feb  4 13:49:54 2022 (furiously fast) with 283848.661 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 1.  (rc=256)
ERROR:

ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to partition the reads.
ABORT:
skoren commented 2 years ago

Duplicate with #2080