Closed oneillkza closed 5 years ago
This is probably the same cause as #1080 but that one went away with a version update so we couldn't reproduce/fix it. If you can package the data (@brianwalenz can you give the commands to package), we can try to reproduce fix/it.
Thanks -- I'll check whether I'm allowed to share the data. It's non-human, so should be OK ethics-wise, but also unpublished.
@brianwalenz How much / which parts of the data do you need?
I need just the one contig that fails, contig number 678, of length 389218 bp (the last line in the table output).
From within the unitigging/ directory, generate a dump of the data for this contig with:
utgcns \
-S Marmota_flaviventris.ctgStore/partitionedReads.seqStore \
-T Marmota_flaviventris.ctgStore 1 1 \
-tig 678 \
-export contig678.export
test that it's still failing:
utgcns -e 0.2 -import contig678.export
then upload to the FTP listed in the FAQ.
Thanks -- yes, that reproduces the core dump.
I'm still confirming from the project leads whether it's OK to share the data. I'm sure it will be, considering how small it is, but I have to go through the formality.
OK -- unfortunately we're prohibited from running non-secure ftp over here, and sftp with username anonymous / password blank didn't work, but the file is pretty small, so I've just gzipped and attached it here.
Thanks! I removed the link from the comment. I'll try to get this fixed tomorrow.
Perfect! Thank you!
Well, not so perfect. There's little I can do to fix this. The first "read" in this contig has 10% N's, and unfortunately, about 18,000 of those are on the end of the read. These are preventing consensus from finding an alignment between the first and second read.
The read in question is Canu ID 462617, of length 292928 bp. Look in *.seqStore/readNames.txt for a mapping to the original read name. I suspect that your Chromium contigs are actually scaffolds!
Looking at the contig size and coverage in the logging above, I'm concerned about the sanity of the result -- the coverage in contigs is all over the place. Some are ~3x, some are ~10x, some are up to 30x. Nothing seems to have really assembled.
How were the nanopore corrected reads generated? How much coverage are we talking about in these? I assume the Chromium contigs are about 1x coverage. There might not be enough coverage to get good corrected reads. Then to get any assembly in unique regions, allowed error in overlaps would need to be high, which would make a mess of the repeats, which would break the assembly.
With some effort, we can probably bash this assembly through to completion by just ignoring the bad contig. I'd have to figure out how to (easily) do that though.
I've managed to fix the problem, but you'll need to install the unreleased 'clone or download' version from github.
Thanks @brianwalenz
Yeah, on my cursory browse through that file I did wonder if all the Ns weren't doing it.
The nanopore corrected reads were from a prior run of Canu. I also ran them through porechop prior to that, although I have a hunch Canu does adapter detection itself so that may not have been necessary? Originally they were from a single MinION run. We're not sure of the size of the genome we're assembling, but going from an estimate of 2-3Gbp, at best we have ~4X coverage from the nanopore reads.
The 10X contigs were generated by Supernova from raw 10X reads with ~100X coverage. As such, they're likely to be of reasonably high quality. My understanding of the wider project is that it is primarily one of building genomes using 10X data, but there was a desire to mix in some MinION data for a handful of species to see if it could improve scaffolding.
For this run, I was trying to trick Canu into using the contigs (in FASTA form) as "nanopore corrected reads". Unfortunately, as I've since realised, it isn't possible to turn off the trimming stage. I haven't checked, but am fairly certain that large parts of the genome will only have coverage from the 10X data, which Canu would be seeing as being 1X (since it's contigs).
My one thought was to try duplicating all the 10X contigs (ie just cat the FASTA onto the end of itself) to try and trick Canu into thinking they have 2X coverage.
My other thought was that this whole project is somewhat outside of what Canu was built for! The alternate plan, which we are also trying, is to take the trimmed and corrected reads from running Canu on the nanopore data alone, and pass these to LINKS along with the 10X draft assembly.
Anyway, thanks so much for your time and input! I think we should be able to continue from here. (But I'm also happy to brainstorm further.)
You can definitely turn of trimming, just specify canu -assemble
but you might not want to turn off trimming for the nanopore data. As for the 1x issue, there is a long history of making faux long reads from the assembly (see for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1544072/) so you could take that strategy and generate 10kb reads from your assembly but as you said this isn't really what canu was designed for.
Given you only have 4x of nanopore data you aren't going to get much corrected/trimmed data so you could instead try using it to gap fill the 10x assembly.
Thanks! That's useful to know. And yes, my plan had been to run Canu once on the nanopore reads, take those post-trimming/correction, and then feed them into a second assembly-only run alongside the 10X.
But yeah, we're almost certainly going to be taking the other approach of just using it to gapfill the 10x. (We'll run this with corrected reads and without and see if it makes much difference.)
Hi
I got a core dump at the consensus stage of unitigging. I'm running Canu from within Conda, on a single large (144 CPU / 1.5TB RAM) machine running CentOS6.7.
I'm also kinda doing a hacky thing trying to combine a low-coverage nanopore set with some contigs generated from high-coverage 10X Chromium data (but pretending that the contigs are corrected nanopore reads). I realise this is beyond the design parameters of Canu, but still thought you might want to try and find/catch the bug.
Only the first job in the consensus stage seems to have failed, so I've included the output of that log file.
Canu command:
Conda environment:
marmot_assembly_chromium/unitigging/5-consensus/consensus.000001.out