HTaekOppa commented 6 years ago

Dear SMARTdenovo,

Hope this email finds you well. While I was testing the program for a PacBio data (genome size 2.5Gb) in PBSpro environment, I have bumped into the same issue constantly at the "[SMDS849.dmo.ovl] Bus error (core dumped)". There was a bus error on SMDS849.dmo.ovl. FYI, please see below for the output file.

Looking forward to your reply!

Regards,

Taek

SMD_SRX849_Output.txt

ruanjue commented 6 years ago

Dear Taek,

Bus error is really hard to debug. Hope we can find the cause without debuging. Please check those : 1, try -k 17. I saw message 'average kmer depth = 6202', too high 2, check the total RAM in the executing computer. 'Mem usage : 225804984kb', maybe lack of RAM

If still bus error, let's try to run it in GDB to see what's wrong.

Jue

HTaekOppa commented 6 years ago

Hi Jue,

Unfortunately, there has been no luck. After testing with -k 17, it is still same "bus error".

Re Mem usage and requested, while I have requested enough RAM (more than 400Gb), the SMARTdenovo command has been failing to generate SMD.dmo.ovl. After having a chat with our Univ. HPC team, they said like "Looks like his program is designed to work well on smaller assembly jobs, and he hasn't thought too hard about minimising the memory footprint. I don't think there is anything we can do to make it work other than what the developer says."

I think we need to fix the fundamental problems to cope with large genome and high repeat sequence.

Any idea?

Regards,

Taek

ruanjue commented 6 years ago

Dear Taek,

SMARTdenovo has assembled several bigger genomes. So I think the BUS error might be caused by a BUG I haven't met.

To solve this problem, I need your help to debug it. Please re-compile wtzmo by make DEBUG=1 wtzmo, then invoke gdb to run the new compiled wtzmo with your data. When encounter error, record them, and type backtrace in gdb, also info locals. Your help will be very improtant for me to find the BUG and kill it.

BTW, please also have a try with https://github.com/ruanjue/wtdbg-1.2.8 . It is my favor in assembling long noisy reads.

Best, Jue

HTaekOppa commented 6 years ago

Hi Jue,

I am testing it now. Will let you know soon. Re WTDBG, I have already tested for 6 different plant/crop PacBio data. Assembly is ok but the overall accuracy and completeness (BUSCO) was bit questionable.

Cheers,

Taek

HTaekOppa commented 6 years ago

Hi Jue,

I have tried as you suggested. Unfortunately, there is no luck. Please, see the attached file. The gdb approach was not easy as it looks. Thus, if ok, could you please test it from your end? You will be able to retrieve the same PacBio data from https://www.ncbi.nlm.nih.gov/sra/?term=SRX1472849. Your cooperation will be really import for my on-going data analysis.

Cheers,

Taek SMD_Mz.txt

ruanjue commented 6 years ago

Hi,

By chance, we had tested wtdbg-1.2.8 on B73 PacBio dataset, it performed well both on contiguity and accuracy.

$> wtdbg-1.2.8 -t 0 -i B73.fa -fo B73-dbg --tidy-reads 5000 -S 1 -p 21 -k 0 --aln-dovetail 512 --aln-noskip --edge-min 3 --rescue-low-cov-edges
#Result TOT 2147062016, CNT 6055, AVG 354594, MAX 8213760, N50 1242624, L50 522, N90 300544, L90 1843, MIN 5120
$> wtdbg-cns -t 0 -i B73-dbg.ctg.lay -fo ctg.fa
#Then call pilon
# skips commands for NGS bam files
$> java -Xmx512G -jar pilon.jar --genome ctg.fasta --frags ctg.sort.bam --fix bases --changes --output pilon
#Result TOT 2093705327, CNT 6055, AVG 345781.23, MAX 7937056, N50 1212401, L50 524, N90 290241, L90 1849, MIN 2543

Based on the polished assemblies, BUSCO gave C:95%[D:22%], F:1.1%, M:3.4%, n:956.

I really don't think the assembly is questionable, please try wtdbg-1.2.8 again.

We will test SMARTdenovo on B73 too, but it certainly take a relative longer time and huge computing resource. If you think our assembly of N73 using wtdbg-1.2.8 is ok, I will cancel the test of SMARTdenovo.

Thanks, Jue

HTaekOppa commented 6 years ago

Hi Jue,

Thank you for the email.

Re SMARTdenovo, could you please test it from your end? I do want to see the outcome. Re WTDBG-1.2.8, please see the attached file (run.sh script). B73_run_sh.txt

I am happy to try it again from my end but could you please check the parameters? It looks like there are few different parameters in your trial. According to my attempt (using the attached run.sh), the BUSCO result was C:17.0%[S:16.8%,D:0.2%],F:2.7%,M:80.3%,n:1440. The poor BUSCO was consistent for other 6 different PacBio dataset and assemblies. Another difference was I used unpolished contigs for BUSCO. Do you think it would be a big difference between two (polished vs unpolished contigs)?

Cheers,

Taek

ruanjue commented 6 years ago

Ok, I will test SMARTdenovo and give you the result, keep in touch.

contigs from wtdbg-cns is roughly polished for further polish, the error rate is still at about 1%, but can be mapped by short reads. So, you need to polish it using Quiver and/or pilon.

Best, Jue

HTaekOppa commented 6 years ago

Hi Jue,

I have tried again with your parameters. Please, see the attached fie “run.sh”. And, please tell me if I have missed something.

After this, two fasta files were generated namely; “dbg.ctg.lay.fa” & “dbg.map.fa”. According to your explanation, it says “contigs from wtdbg-cns is roughly polished for further polish, the error rate is still at about 1%, but can be mapped by short reads. So, you need to polish it using Quiver and/or pilon.” I can see that the further polish with PacBio (Quiver and Arrow) and Illumina reads (Pilon) will reduce the error rate. However, what I am trying to see is whether the contigs from “wtdbg-cns” (I assume it would be “dbg.map.fa”) would be enough to talk about the overall accuracy and completeness (BUSCO). Thus, I have compared the final assemblies, assembled from five different de novo assemblers without any further polishing. While the overall stats (e.g. N50) are really good compared to other assemblers (easy to use and fast with less memory), the completeness (BUSCO) of WTDBG was still bit interesting. Please, see below;

dbg.ctg.lay.fa: C:18.4%[S:18.3%,D:0.1%],F:3.7%,M:77.9%,n:1440 dbg.map.fa: C:47.0%[S:46.9%,D:0.1%],F:8.1%,M:44.9%,n:1440

After polishing it, I might be able to see the similar result (C:95%[D:22%], F:1.1%, M:3.4%, n:956). However, I am just wondering why there is a huge difference between contigs from wtdbg-cns and its polished one? Could you please tell me how to interpret this? Do you think the difference of 1% error rate is the main factor for this case? Or is the WTDBG designed to use the final polished one?

Many thanks all the time!

Taek run.sh.txt

ruanjue commented 6 years ago

Both dbg.ctg.lay.fa and dbg.map.fa are ok for further polishing. The difference is dbg.map.fa uses more long noisy reads in quick polishing in wtdbg-cns.

I am glad that you got the similar BUSCO evaluations. wtdbg-cns is designed to quick polish the assemblies for further polishment. It implemented a DBG-based error correction module using only long noisy reads. It often leave 1% small size errors cannot be corrected, but good enough to be polished by other tools. Because Quiver and Pilon can produce very good results, I have no plan to develop my own polisher again.

Best, Jue

HTaekOppa commented 6 years ago

Clarification: I did not get the good BUSCO yet because I have not polished yet. Polish: I can see that from your script you used Illumina data and Pilon for B73.

skips commands for NGS bam files

$> java -Xmx512G -jar pilon.jar --genome ctg.fasta --frags ctg.sort.bam --fix bases --changes --output pilon

Result TOT 2093705327, CNT 6055, AVG 345781.23, MAX 7937056, N50 1212401, L50 524, N90 290241, L90 1849, MIN 2543

However, the Illumina data is not always available for all dataset. So, what if I want to polish the final contigs using only PacBio reads (fasta or fastq) with Quiver and Arrow? Have you tried this using Quiver and/or Arrow with fastq reads? Or can I try Racon for the final contigs to improve the final consensus? At this stage, it looks like WTDBG is required the polishing step (either with PacBio or Illumina) to talk about the BUSCO completion.

Cheers,

Taek

ruanjue commented 6 years ago

Quiver/Arrow is also ok to polish the assemblies, but you need to download huge H5 files.

When you got a good enough assembly, polishing it using NGS or TGS data is not a busy job. The de novo assembly of long noisy reads lays heavily on the contiging. So, wtdbg focus on how to build contigs quickly.

Best, Jue

biowackysci commented 5 years ago

Hello All, I am trying to do a smartdenovo assembly on my PacBio and Nanopore MinION and PromethION generated data. I got upto the .cns files and then I am not sure what is meant to happen from there. I was hoping a summary file with the number of contigs and the N50 all that would be generated but could not find any of it. Can someone help me find it, please

Thanks Saila

ruanjue commented 5 years ago

The .cns is the contigs sequences.

Best, Jue

HTaekOppa commented 5 years ago

Hi Jue,

Not sure whether you had a chance to test the SMRTdenovo for me (Please see the comment on June 5). Meantime, I have tried again but no success. It might be due to walltime issue. I am happy to re-submit the job but just wondering whether you have any luck.

Cheers,

Taek

ruanjue commented 5 years ago

Sorry for the delay.

My colleague Shigang WU had tested SMARTdenovo on B73 pacbio dataset. Here is the commands.

fq2fa.pl all.fastq > all.fasta
wtpre -J 5000 all.fasta | gzip -c -1 > wtasm.fa.gz
wtzmo -t 64 -i wtasm.fa.gz -fo wtasm.dmo.ovl -k 16 -z 10 -Z 16 -U -1 -m 0.1 -A 1000
wtclp -i wtasm.dmo.ovl -fo wtasm.dmo.obt -d 3 -k 300 -m 0.1 -FT
wtlay -i wtasm.fa.gz -b wtasm.dmo.obt -j wtasm.dmo.ovl -fo wtasm.dmo.lay -w 300 -s 200 -m 0.1 -r 0.95 -c 1
wtcns -t 64 wtasm.dmo.lay > wtasm.dmo.cns 2> wtasm.dmo.cns.log
seq_n50.pl wtasm.dmo.cns

He got an assembly of 2.12G in total size, and N50 contig of 150k. Not as good as wtdbg-1.2.8. So he deleted the assembly.

Hope it helps.

Best, Jue

biowackysci commented 5 years ago

Hello all, Was wondering if samartdenovo does hybrid assemblies of Pacbio and Nanopore ? Or is it just either Pacbio or Nanopore? Ca someone please help me with this ? Thanks heaps in advance Saila

ruanjue commented 5 years ago

SMARTdenovo works well on both dataset with default setting, but I am not sure about the combined data.

ruanjue / smartdenovo

[SMDS849.dmo.ovl] Bus error (core dumped) #16

skips commands for NGS bam files

Result TOT 2093705327, CNT 6055, AVG 345781.23, MAX 7937056, N50 1212401, L50 524, N90 290241, L90 1849, MIN 2543