marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

input file Format #1243

Closed ghost closed 5 years ago

ghost commented 5 years ago

Hi, I'm new on sequencing, so I try to do assembly with canu on file that I have created for locking how assembly really works. So I created a file with some sequences and I added overlapping. -firstLine secondLine ____ ............ I saved it on .fasta and try it with canu but I get some errors the first is: "Found Nanopore uncorrected reads in the input files." is my file structure wrong or I missed anything?

overlap.txt

mastermindchr commented 5 years ago

I think fasta files should have a header starting with “>” and then just add some title , for example

contig1 AAAAAACCCCCGGGACACAG

Just saving as a fasta doesn’t change anything it’s still a simple txt file.

ghost commented 5 years ago

I added headers but it's the same or maybe I need to more size of data overlapped file:

>id-8593-0
CGCAGCGACTGACTCGCGTCTATGTCGCACGAGTATAGTGTGTAGTCACGCAGCGTGACACTACTACGCACGTATGTCGTGATACTACTCGTCACGTATG
>id-8593-25
AGCACGACTACGTCTGTAGCATACACATACGTGACGAGTAGTATCACGACATACGTGCGTAGTAGTGTCACGCTGCGTGACTACACACTATACTCGTGCG
>id-8593-50
CAGCGTGACACTACTACGCACGTATGTCGTGATACTACTCGTCACGTATGTGTATGCTACAGACGTAGTCGTGCTCGACACACGACTGTATCAGTCGCAC
skoren commented 5 years ago

@belee2 The message you report is not an error, there should be more output than that. Can you post both your Canu command and the full output of Canu? All your reads are very short (100bp) and by default Canu will trim any reads <1000.

When I run canu with your data I get the following:

anu genomeSize=1k -p asm -d test -nanopore-raw overlap.txt
-- Canu snapshot v1.8
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '10.0.1' (from 'java') without -d64 support.
-- Detected gnuplot version '5.2 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
-- Detected 8 CPUs and 16 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl      8 GB    4 CPUs x   2 jobs    16 GB    8 CPUs  (k-mer counting)
-- Local: hap        8 GB    4 CPUs x   2 jobs    16 GB    8 CPUs  (read-to-haplotype assignment)
-- Local: cormhap    6 GB    8 CPUs x   1 job      6 GB    8 CPUs  (overlap detection with mhap)
-- Local: obtovl     4 GB    8 CPUs x   1 job      4 GB    8 CPUs  (overlap detection)
-- Local: utgovl     4 GB    8 CPUs x   1 job      4 GB    8 CPUs  (overlap detection)
-- Local: ovb        4 GB    1 CPU  x   4 jobs    16 GB    4 CPUs  (overlap store bucketizer)
-- Local: ovs        8 GB    1 CPU  x   2 jobs    16 GB    2 CPUs  (overlap store sorting)
-- Local: red        8 GB    4 CPUs x   2 jobs    16 GB    8 CPUs  (read error detection)
-- Local: oea        4 GB    1 CPU  x   4 jobs    16 GB    4 CPUs  (overlap error adjustment)
-- Local: bat       16 GB    4 CPUs x   1 job     16 GB    4 CPUs  (contig construction with bogart)
-- Local: gfa        8 GB    4 CPUs x   1 job      8 GB    4 CPUs  (GFA alignment and processing)
--
-- Found Nanopore uncorrected reads in the input files.
--
-- Generating assembly 'asm' in 'test'
--
-- Parameters:
--
--  genomeSize        1000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Thu Feb 14 11:33:16 2019 with 143.804 GB free disk space

    cd .
    canu/Darwin-amd64/bin/sqStoreCreate \
      -o ./asm.seqStore.BUILDING \
      -minlength 1000 \
      ./asm.seqStore.ssi \
    > ./asm.seqStore.BUILDING.err 2>&1

-- Finished on Thu Feb 14 11:33:16 2019 (fast as lightning) with 143.803 GB free disk space
----------------------------------------
--
-- WARNING:  Potential problems with your input reads were detected.
-- WARNING:
-- WARNING:  Please review the logging in files:
-- WARNING:    test/asm.seqStore.BUILDING.err
-- WARNING:    test/asm.seqStore.BUILDING/errorLog
-- 
-- Proceeding with assembly because stopOnReadQuality=false.
No objects to dump; reversed ranges make no sense: bgn=1 end=0??
--
-- ERROR:  Read coverage (0) is too low to be useful.
-- ERROR:
-- ERROR:  This could be caused by an incorrect genomeSize or poor quality reads that could not
-- ERROR:  be sufficiently corrected.
-- ERROR:
-- ERROR:  You can force Canu to continue by decreasing parameter stopOnLowCoverage=10,
-- ERROR:  however, the quality of corrected reads and/or contiguity of contigs will be poor.
-- 
ghost commented 5 years ago

@skoren this is not real read and I don't known how to segment my bases into different reads (concretly what is a read?). Also I don't found the result of the assembly. when I try to run canu I get this output for the first time:

belee@stud:~/DNA$ canu  -p test -d test-oxford  genomeSize=0.8m -nanopore-raw overlap.txt 
-- Canu snapshot v1.8 +106 changes (r9316 490b517a05233e031c60260d709a2442bc7995a2)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_191' (from 'java') with -d64 support.
-- Detected gnuplot version '5.0 patchlevel 3   ' (from 'gnuplot') and image format 'png'.
b-- Detected 4 CPUs and 4 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl      4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (k-mer counting)
-- Local: hap        4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (read-to-haplotype assignment)
-- Local: cormhap    4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (overlap detection with mhap)
-- Local: obtovl     4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (overlap detection)
-- Local: utgovl     4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (overlap detection)
-- Local: cor      --- GB    4 CPUs x --- jobs   --- GB  --- CPUs  (read correction)
-- Local: ovb        4 GB    1 CPU  x   1 job      4 GB    1 CPU   (overlap store bucketizer)
-- Local: ovs        4 GB    1 CPU  x   1 job      4 GB    1 CPU   (overlap store sorting)
-- Local: red        4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (read error detection)
-- Local: oea        4 GB    1 CPU  x   1 job      4 GB    1 CPU   (overlap error adjustment)
-- Local: bat        4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (contig construction with bogart)
-- Local: cns      --- GB    4 CPUs x --- jobs   --- GB  --- CPUs  (consensus)
-- Local: gfa        4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (GFA alignment and processing)
--
-- Found Nanopore uncorrected reads in the input files.
--
-- Generating assembly 'test' in '/DNA/test-oxford'
--
-- Parameters:
--
--  genomeSize        800000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Thu Feb 14 18:04:10 2019 with 2.478 GB free disk space

    cd .
    /home/belee/DNA/canu/Linux-amd64/bin/sqStoreCreate \
      -o ./test.seqStore.BUILDING \
      -minlength 1000 \
      ./test.seqStore.ssi \
    > ./test.seqStore.BUILDING.err 2>&1

-- Finished on Thu Feb 14 18:04:10 2019 (fast as lightning) with 2.476 GB free disk space  !!! WARNING !!!
----------------------------------------
--
-- WARNING:  Potential problems with your input reads were detected.
-- WARNING:
-- WARNING:  Please review the logging in files:
-- WARNING:    /home/belee/DNA/test-oxford/test.seqStore.BUILDING.err
-- WARNING:    /home/belee/DNA/test-oxford/test.seqStore.BUILDING/errorLog
-- 
-- Proceeding with assembly because stopOnReadQuality=false.
No objects to dump; reversed ranges make no sense: bgn=1 end=0??
--
-- ERROR:  Read coverage (0) is too low to be useful.
-- ERROR:
-- ERROR:  This could be caused by an incorrect genomeSize or poor quality reads that could not
-- ERROR:  be sufficiently corrected.
-- ERROR:
-- ERROR:  You can force Canu to continue by decreasing parameter stopOnLowCoverage=10,
-- ERROR:  however, the quality of corrected reads and/or contiguity of contigs will be poor.
-- 

ABORT:
ABORT: Canu snapshot v1.8 +106 changes (r9316 490b517a05233e031c60260d709a2442bc7995a2)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

-second Time I get this :

belee@stud:~/DNA$ canu  -p test -d test-oxford  genomeSize=0.8m -nanopore-raw overlap.txt 
-- Canu snapshot v1.8 +106 changes (r9316 490b517a05233e031c60260d709a2442bc7995a2)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_191' (from 'java') with -d64 support.
-- Detected gnuplot version '5.0 patchlevel 3   ' (from 'gnuplot') and image format 'png'.
-- Detected 4 CPUs and 4 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl      4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (k-mer counting)
-- Local: hap        4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (read-to-haplotype assignment)
-- Local: cormhap    4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (overlap detection with mhap)
-- Local: obtovl     4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (overlap detection)
-- Local: utgovl     4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (overlap detection)
-- Local: cor      --- GB    4 CPUs x --- jobs   --- GB  --- CPUs  (read correction)
-- Local: ovb        4 GB    1 CPU  x   1 job      4 GB    1 CPU   (overlap store bucketizer)
-- Local: ovs        4 GB    1 CPU  x   1 job      4 GB    1 CPU   (overlap store sorting)
-- Local: red        4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (read error detection)
-- Local: oea        4 GB    1 CPU  x   1 job      4 GB    1 CPU   (overlap error adjustment)
-- Local: bat        4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (contig construction with bogart)
-- Local: cns      --- GB    4 CPUs x --- jobs   --- GB  --- CPUs  (consensus)
-- Local: gfa        4 GB    4 CPUs x   1 job      4 GB    4 CPUs  (GFA alignment and processing)
--
-- Found Nanopore uncorrected reads in the input files.
--
-- Generating assembly 'test' in '/home/belee/DNA/test-oxford'
--
-- Parameters:
--
--  genomeSize        800000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
--
-- WARNING:
-- WARNING:  No raw reads detected.  Cannot proceed; empty outputs generated.
-- WARNING:
--
-- Finished stage 'generateOutputs', reset canuIteration.
--
-- Assembly 'test' finished in '/home/belee/DNA/test-oxford'.
--
-- Summary saved in 'test.report'.
--
-- Sequences saved:
--   Contigs       -> 'test.contigs.fasta'
--   Unassembled   -> 'test.unassembled.fasta'
--   Unitigs       -> 'test.unitigs.fasta'
--
-- Read layouts saved:
--   Contigs       -> 'test.contigs.layout'.
--   Unitigs       -> 'test.unitigs.layout'.
--
-- Graphs saved:
--   Contigs       -> 'test.contigs.gfa'.
--   Unitigs       -> 'test.unitigs.gfa'.
--
--
-- BEGIN TRIMMING
--
--
-- WARNING:
-- WARNING:  No corrected reads detected.  Cannot proceed; empty outputs generated.
-- WARNING:
--
-- Finished stage 'generateOutputs', reset canuIteration.
--
-- Assembly 'test' finished in '/home/belee/DNA/test-oxford'.
--
-- Summary saved in 'test.report'.
--
-- Sequences saved:
--   Contigs       -> 'test.contigs.fasta'
--   Unassembled   -> 'test.unassembled.fasta'
--   Unitigs       -> 'test.unitigs.fasta'
--
-- Read layouts saved:
--   Contigs       -> 'test.contigs.layout'.
--   Unitigs       -> 'test.unitigs.layout'.
--
-- Graphs saved:
--   Contigs       -> 'test.contigs.gfa'.
--   Unitigs       -> 'test.unitigs.gfa'.
--
-- Bye.
skoren commented 5 years ago

Yeah so it is exactly what I said, Canu is saying you ended up with 0 coverage for input. The detailed logs it points you to will say that all the reads were eliminated because they're too short.

If you want to test some example reads, the Canu documentation page provides some datasets. You can also start by reading the papers on assembly to get a better idea of what a read is and how assembly works. The ones Canu lists when it starts up are a good start. In general, there are lots of read simulators available which will generate read sequences from a "genome" you want to assemble. So you probably want to make a synthetic genome first which is a single fasta entry and then use the simulators (which mimic the sequencing process) to make the reads.