marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

CANU is failing - bogart issue #1323

Closed anja999 closed 5 years ago

anja999 commented 5 years ago

Dear all, I am trying to run the CANU on my DATA and it keeps failing before assembly step (bogart failed). So I have corrected and trimmed data as the output. Anyway, I have directRNA data and I am trying to do the metagenomics approach to see what is in the sample of plant material infected with viruses/viroids. The transcripts and genomes should not be longer than 20k. When I was running the command was: /canu -d run1 -p run1 genomeSize=20k -nanopore-raw/DATA/run1.fa overlapper=mhap utgReAlign=true corOutCoverage=10000 corMhapSensitivity=high minReadLength=100 minOverlapLength=100 corMinCoverage=0 minMemory=100 maxMemory=200 maxThreads=24

I was trying also some other options of the command which I found by reading all these different issues of bogard failed but without success. Also, the assembly takes 10 days (we have a server with 36 threads and 250 memory) - is this normal?

Many thanks for any help!

skoren commented 5 years ago

Can you provide more details on how it is failing, post the unitigger.err log? I would guess it is similar to #1281, because the genome size is so low it's trying to load lots of overlaps and doesn't have enough memory (note in the metagenomic FAQ parameters it explicitly increases bogart memory to avoid this). If it is the same issue, you can edit unitigger.sh similarly (increase genome size, increase -M option) and resume canu.

As for total runtime, it depends how much data you have, you're also dropping minimum overlap and read lengths much lower than the default which will add to runtime.

anja999 commented 5 years ago

Many thanks. I will try to change genomeSize and minOverlapLength. The thing is that the genomes sizes are really around 10k. If I would increase it maybe the assembly would not be correct? The unitigger.err is saying as you assumed, not enough memory. Do you have any idea how much memory would be enough? I could use 500.

Untrigger.err

==> PARAMETERS.

Resources: Memory 16 GB Compute Threads 4 (command line)

Lengths: Minimum read 0 bases Minimum overlap 500 bases

Overlap Error Rates: Graph 0.120 (12.000%) Max 0.120 (12.000%)

Deviations: Graph 6.000 Bubble 6.000 Repeat 3.000

Edge Confusion: Absolute 2100 Percent 200.0000

Unitig Construction: Minimum intersection 500 bases Maxiumum placements 2 positions

Debugging Enabled: (none)

==> LOADING AND FILTERING OVERLAPS.

ReadInfo()-- Using 140318 reads, no minimum read length used.

OverlapCache()-- limited to 16384MB memory (user supplied).

OverlapCache()-- 1MB for read data. OverlapCache()-- 5MB for best edges. OverlapCache()-- 13MB for tigs. OverlapCache()-- 3MB for tigs - read layouts. OverlapCache()-- 5MB for tigs - error profiles. OverlapCache()-- 4096MB for tigs - error profile overlaps. OverlapCache()-- 0MB for other processes. OverlapCache()-- --------- OverlapCache()-- 4128MB for data structures (sum of above). OverlapCache()-- --------- OverlapCache()-- 2MB for overlap store structure. OverlapCache()-- 12253MB for overlap data. OverlapCache()-- --------- OverlapCache()-- 16384MB allowed. OverlapCache()-- OverlapCache()-- Retain at least 10012 overlaps/read, based on 5006.05x coverage. OverlapCache()-- Initial guess at 5722 overlaps/read. OverlapCache()-- OverlapCache()-- Not enough memory to load the minimum number of overlaps; increase -M.

skoren commented 5 years ago

The genome size won't make the assembly wrong, it's just used to compute some statistics and to guess at the coverage in your dataset.

You can of course increase the memory, I expect 500 will be enough. However, the result of increasing the genome size or memory won't be very different. You don't really need 5000 overlaps per read to assemble the amplicon. I would increase the genome size to 1mb and see if it runs in the current memory.

anja999 commented 5 years ago

Dear Sergey!

I was trying different things and the Canu was still failing in the last stages. Can I share some data with you and then you would maybe see what is wrong with them?

I will be really happy if you could help me. Trying different thinks is really time consuming since it takes days to fail again.

Sorry for bothering you and many thanks!

Anja


Anja Pecman Mlada raziskovalka / PhD Student Nacionalni inštitut za biologijohttp://www.nib.si/ / National Institute of Biologyhttp://www.nib.si/eng/ Oddelek za biotehnologijo in sistemsko biologijohttp://www.nib.si/oddelki/oddelek-za-biotehnologijo-in-sistemsko-biologijo Department of Biotechnology and Systems Biologyhttp://www.nib.si/eng/index.php/departments/department-of-biotechnology-and-systems-biology Večna pot 111, SI-1000 Ljubljana, Slovenia

Phone: + 386 (0)59 232 823 Fax: + 386 (0)1 257 38 47 E-mail: anja.pecman@nib.simailto:anja.pecman@nib.si

[cid:image001.png@01D51C66.0D148C30]

From: Sergey Koren [mailto:notifications@github.com] Sent: Thursday, April 11, 2019 3:29 PM To: marbl/canu canu@noreply.github.com Cc: Anja Pecman Anja.Pecman@nib.si; Author author@noreply.github.com Subject: Re: [marbl/canu] CANU is failing - bogard issue (#1323)

Can you provide more details on how it is failing, post the unitigger.err log? I would guess it is similar to #1281https://github.com/marbl/canu/issues/1281, because the genome size is so low it's trying to load lots of overlaps and doesn't have enough memory (note in the metagenomic FAQ parameters it explicitly increases bogart memory to avoid this). If it is the same issue, you can edit unitigger.sh similarly (increase genome size, increase -M option) and resume canu.

As for total runtime, it depends how much data you have, you're also dropping minimum overlap and read lengths much lower than the default which will add to runtime.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/marbl/canu/issues/1323#issuecomment-482113898, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AvOA6C7FSlxh4QZscUVpZHgYNLgRYsg7ks5vfzhsgaJpZM4co9oU.

skoren commented 5 years ago

Sure you can upload your full run directory or just the reads following the instructions on the FAQ. It shouldn't take days to re-run anything in the bogart step, that's all you need to run to test changing memory/genome size.

skoren commented 5 years ago

Did you ever upload any data, i don't see anything on the FTP site.

anja999 commented 5 years ago

Dear Sergey,

I have just uploaded testAP.fastq file. I was basecalling again the data and because of that it took so long.

So I would like to have metagenomics approach and when I was running the canu I tried to use this:

overlapper=mhap utgReAlign=true corOutCoverage=10000 corMhapSensitivity=high minReadLength=100 minOverlapLength=100 corMinCoverage=0 (found in canu manual)

or this command

obtOverlapper=mhap obtReAlign=raw utgOverlapper=mhap utgReAlign=raw corOutCoverage=10000 corMhapSensitivity=high minReadLength=100 minOverlapLength=100 corMinCoverage=0 (found on github).

Do you think that this could work?

Normally the bogard failed.

Many thanks for any help,!

Best regards,

Anja


Anja Pecman Mlada raziskovalka / PhD Student Nacionalni inštitut za biologijohttp://www.nib.si/ / National Institute of Biologyhttp://www.nib.si/eng/ Oddelek za biotehnologijo in sistemsko biologijohttp://www.nib.si/oddelki/oddelek-za-biotehnologijo-in-sistemsko-biologijo Department of Biotechnology and Systems Biologyhttp://www.nib.si/eng/index.php/departments/department-of-biotechnology-and-systems-biology Večna pot 111, SI-1000 Ljubljana, Slovenia

Phone: + 386 (0)59 232 823 Fax: + 386 (0)1 257 38 47 E-mail: anja.pecman@nib.simailto:anja.pecman@nib.si

[cid:image001.png@01D52205.AF3F9BE0]

From: Sergey Koren [mailto:notifications@github.com] Sent: Thursday, June 13, 2019 4:22 PM To: marbl/canu canu@noreply.github.com Cc: Anja Pecman Anja.Pecman@nib.si; Author author@noreply.github.com Subject: Re: [marbl/canu] CANU is failing - bogart issue (#1323)

Did you ever upload any data, i don't see anything on the FTP site.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/marbl/canu/issues/1323?email_source=notifications&email_token=ALZYB2DAMO73H2PK3WIOHJLP2JJXBA5CNFSM4HFD3IKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXT3IMY#issuecomment-501724211, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALZYB2ASL2FSF7S5N2HHJ43P2JJXBANCNFSM4HFD3IKA.

skoren commented 5 years ago

The docs are going to be more up to date since there are GitHub issues referring to canu versions that no longer exist. Your second command seems reasonable though the metagenomic options also increase the bat memory which you've omitted (obtOverlapper=mhap obtReAlign=raw utgOverlapper=mhap utgReAlign=raw corOutCoverage=10000 corMhapSensitivity=high minReadLength=100 minOverlapLength=100 corMinCoverage=0 'redMemory=32' 'oeaMemory=32' 'batMemory=200' ). I was able to run an assembly of your data using the above without error setting genome size to 20k. I do see a few contigs in the 17-20kb range.

However, I am not sure what assembling direct RNA means? Aren't these already full-length transcripts so is there anything to assemble? Are you trying to see if you can assemble RNA viruses?