marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
646 stars 178 forks source link

failed to read estimated mer threshold #486

Closed jordur closed 7 years ago

jordur commented 7 years ago

Dear, I was trying to assembly 274 reads from a nanopore minion run with canu using the following command line:

canu -p Canu_assembly -d Canu_assembly corMinCoverage=0 errorRate=0.035 contigFilter="2 1000 1.0 1.0 2" genomeSize=2m -nanopore-raw all_unmapped.fastq.gz

We are not sure about the sample size since it is from a non-model organism. However it comes from 8 BAC genomic clones, 300Kb length each so we estimated a 2Mb genome fragment size. Finally I got the following log file:

canu failed with 'failed to read estimated mer threshold from 'unitigging/0-mercounts/Canu_assembly_Bjararaca.ms22.estMerThresh.out''.

What does actually the message mean? I guess a low overlapping between reads but it would be great to test different options on the command line. Could you help me with this issue' Regards

brianwalenz commented 7 years ago

Counting occurrences of kmers failed. Are there any errors in unitigging/0-mercounts/meryl.out (or similarly named file)? What does 'ls -l unitigging/0-mercounts/' show?

Does the read length histogram reported in the canu screen output look reasonable?

jordur commented 7 years ago

Hi @brianwalenz. Here it is:

$ ls -l unitigging/0-mercounts/
total 3104
-rw-r--r-- 1 jdurban ibv    1097  9 mai 18:10 Canu_assembly_Bjararaca.ms22.estMerThresh.err
-rw-r--r-- 1 jdurban ibv       0  9 mai 18:10 Canu_assembly_Bjararaca.ms22.estMerThresh.out.WORKING
-rw-r--r-- 1 jdurban ibv     455  9 mai 18:10 Canu_assembly_Bjararaca.ms22.histogram
-rw-r--r-- 1 jdurban ibv      97  9 mai 18:10 Canu_assembly_Bjararaca.ms22.histogram.info
-rw-r--r-- 1 jdurban ibv 2096936  9 mai 18:10 Canu_assembly_Bjararaca.ms22.mcdat
-rw-r--r-- 1 jdurban ibv 1048608  9 mai 18:10 Canu_assembly_Bjararaca.ms22.mcidx
-rw-r--r-- 1 jdurban ibv    2932  9 mai 18:10 meryl.124536_1.out
-rw-r--r-- 1 jdurban ibv      27  9 mai 18:09 meryl.jobSubmit.out
-rwxr-xr-x 1 jdurban ibv     186  9 mai 18:09 meryl.jobSubmit.sh
-rwxr-xr-x 1 jdurban ibv    1854  9 mai 18:09 meryl.sh
-rw-r--r-- 1 jdurban ibv       0  9 mai 18:11 meryl.success

And yes!! There is a segmentation fault:

$tail unitigging/0-mercounts/meryl.124536_1.out
/var/spool/slurmd/job124536/slurm_script: line 63: 10985 **Violació de segment **  $bin/estimate-mer-threshold -h ./Canu_assembly_Bjararaca.ms22.histogram -c 0.2384795 > ./Canu_assembly_Bjararaca.ms22.estMerThresh.out.WORKING 2> ./Canu_assembly_Bjararaca.ms22.estMerThresh.err

what means segmentation fault in catalan ;) So, it seems to be a matter of resources? Should I increase the usage memory? Thanks

brianwalenz commented 7 years ago

Can you upload the *.histogram file? The mcdat and mcidx files would be better, but, WARNING, these contain all the kmers for your assembly (they're small enough to email to me though).

jordur commented 7 years ago

Hi @brianwalenz Attached you will find these files. Thank you for your time! files.zip

brianwalenz commented 7 years ago

Thanks. I can't reproduce the crash. What version of canu do you have?

However, in the segmentation fault error, the command line is reported. It's got a coverage estimate of 0.24. Looking at the histogram, most of the kmers are single copy. I don't think you have enough coverage in input reads for any assembly. To get 20x coverage of 2Mbp with 274 reads - just barely enough for assembly - you'd need an average read length of 150k.

jordur commented 7 years ago

I am using Canu v1.4 (+126 commits). Ok, that's exactly what I suspected. I will set a lower coverage in order obtain some results. In fact the sequencing process was wrong since I introduced a bubble on the sensor array but some reads were produced and I though that some assembly could be performed. In fact I don't think I need a 20X coverage since it is a new genomic region from a non-model organism and a lower coverage could be also valuable. Thank you for your help!

brianwalenz commented 7 years ago

You might as well update to the latest code. I can't point to any specific improvements that would help, but you are a couple months out of date. I don't think it'll help, there just isn't enough data to generate corrected reads, so only repeats are getting output.

jordur commented 7 years ago

Ok, thank you so much! Finally, how you got the estimate of 0.24 coverage? Was it from the histogram file?

brianwalenz commented 7 years ago

It's the -c parameter to estimate-mer-coverage in:

$tail unitigging/0-mercounts/meryl.124536_1.out
/var/spool/slurmd/job124536/slurm_script: line 63: 10985 **Violació de segment **  $bin/estimate-mer-threshold -h ./Canu_assembly_Bjararaca.ms22.histogram -c 0.2384795 > ./Canu_assembly_Bjararaca.ms22.estMerThresh.out.WORKING 2> ./Canu_assembly_Bjararaca.ms22.estMerThresh.err

This is just bases in reads (after correction and trimming) divided by your genome size estimate (the genomeSize parameter). I don't think it's actually used by this program, but still passed in.

The start of the canu logging (to the screen) has a histogram of the read lengths. This also reports coverage.

jordur commented 7 years ago

Great! Thank you so much!