marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

tgStoreLoad throws error because ./5-consensus/utgcns.files provided with -L is an empty file #1396

Closed hwalinga closed 5 years ago

hwalinga commented 5 years ago

Hello, I have an error, and not sure how to prevent it, or if this is an issue with the software itself.

Using canu version 1.8 on Ubuntu 16.04.6 LTS (no virtual machine).

$ canu --version
Canu snapshot v1.8 +261 changes (r9471 093a51f41fcb7029a81dfd7663ee5aa4e44e2110)

Running command:

canu -p S1 -d {S1-genomeSize=40k -executiveThreads=24 -executiveMemory=60G -overlapper=mhap -utgReAlign=True -nanopore-raw filtlong_50/S1.part.fastq

I get the following error:

ERROR: no input tig files supplied on command line or via -L option.

Full error output: https://termbin.com/eg7h

I checked the error log thrown and the problem is that with the option -L ./5-consensus/utgcns.files the file does exist but is just empty. The usage description of this command (tgStoreLoad) states that the program should succeed even if the file provided with the -L option is an empty file. So, I am not sure how problematic it is that the utgcns.files file is empty and that I need to find a way to fix that, or that the tgStoreLoad command should just run anyway, even if that ./5-consensus/utgcns.files file is empty.

brianwalenz commented 5 years ago

Can you share this data? It assembled to one contig but no unitigs....which is a little bit odd.

I think that if you modify ./5-consensus/utgcns.files to include the full path to an empty file it will be happy. Something like:

cd 5-consensus
touch empty-file
echo $PWD/empty-file > utgcns.files

You can test this easily without restarting canu, by running the tgStoreLoad command by hand. Once it runs, check the S1.utgStore directory for presence of *002* files.

If Canu has other issues finishing, you can manually extract the contig with

tgStoreDump \
  -S S1.seqStore \
  -T unitigging/S1.ctgStore 2 \
  -consensus -contigs -fasta > S1.contigs.fasta

(and, actually, if you only care about the contig sequence, I think that last command will work right now)

hwalinga commented 5 years ago

If I add that empty-file to the utgcns.files I get the following error:

ABORT:   unknown consensus job name '/linuxhome/tmp/hielke/canu/S1/unitigging/5-consensus/empty-file'

If I run the tgStoreDump manually I just get an empty file. I mean it could also be canu cannot make any assembly out of the data, but at least I would expect it to tell me that in that case.


I will ask my supervisor if we can share the data. I am afraid the answer is no, but we shall see.

brianwalenz commented 5 years ago

What files exist in unitigging/S1.ctgStore?

What do

tgStoreDump -S S1.seqStore -T unitigging/S1.ctgStore 1 -tigs
tgStoreDump -S S1.seqStore -T unitigging/S1.ctgStore 2 -tigs

report? (they should both report a few lines of tabular data)

hwalinga commented 5 years ago
tgStoreDump -S S1.seqStore -T unitigging/S1.ctgStore 1 -tigs

Report 8305 lines like (https://termbin.com/d6t7):

8324    24118   layout  1.00    1.00    unassm  no      no      1
tgStoreDump -S S1.seqStore -T unitigging/S1.ctgStore 2 -tigs

Report (8305) lines like (https://termbin.com/4uqy):

28      14763   ungapped        1.00    1.00    unassm  no      no      1
brianwalenz commented 5 years ago

Ah! That's also a big clue why there are no unitigs.

To get the sequences, remove '-contig' from tgStoreDump. Canu thinks these are all 'unassembled' crud; it's tuned for larger "genomes".

Browse through the FAQ (https://canu.readthedocs.io/en/latest/faq.html) for some advice, in particular, the 'contigFilter'.

You might also want to experiment with down sampling your reads. https://canu.readthedocs.io/en/latest/parameter-reference.html#readsamplingcoverage

skoren commented 5 years ago

Any updates, did you get an assembly?

hwalinga commented 5 years ago

Hello @skoren

When using the following (so without `-contigs) we indeed were getting some assembled contigs.

tgStoreDump \
  -S S1.seqStore \
  -T unitigging/S1.ctgStore 2 \
  -consensus -fasta > S1.contigs.fasta

However, we weren't so happy after all with the assemblies from canu, and attempted to use flye instead. Which had a lot less problems with assembly, however, I have the idea canu produces more output than flye.

Our problem is that we are trying to assemble chimeric viruses. So there are multiple different species in our samples, but they share a lot of common sequences. We hope to assemble the different species and than see what differences they have (what regions are chimeric). So far, it seems that this chimeric feature of our samples only confuse the assemblers.

If we compare canu and flye, we see that canu produces a lot of small contigs with low coverage, and some big ones, and flye only produces the large contigs.

skoren commented 5 years ago

The reason you have more output is this dump, without the contigs flag, is also outputting single reads which will have low coverage and not be considered "assembled" normally. You can filter those out from the fasta file manually if you want.

Given the information about your sample, this is really a metagenome and the worst case of one in fact since you have lots of very closely related strains with large SVs in a few places. First, the canu command you're using for that is not ideal, you want to use the metagenomic parameters from the FAQ and not the fast option you're doing now to get more accurate overlaps which may help separate some of the strains more. You're going to likely have to look at the assembly graphs to try to resolve some of these species (the unitigs.gfa in canu).

skoren commented 5 years ago

Idle