marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
655 stars 179 forks source link

Bogart failed new #1141

Closed pomidorku closed 5 years ago

pomidorku commented 5 years ago

Hello,

I am running Canu in a High Performance Computing cluster. I used Canu for a bacterial genome assembly (draft) successfully. I am trying to assemble another genome, but the job has failed twice. The message I got was similar for both times the assembly failed: "Bogart failed, tried 2 times, giving up."

I used the following commands:

module load Westmere
module load Canu/1.7-intel-2017A-Perl-5.24.0

canu useGrid=false -p moell -d /scratch/user/user/user2/Moellerella_4_E06/moell_out_out genomeSize=3.3m -pacbio-raw /scratch/user/user/user2/Moellerella_4_E06/m54092_180525_040741.subreadsFQ.fastq

The last lines of the failed job output are:

*****************************************************************************************************
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'oea' concurrent execution on Tue Nov  6 13:26:10 2018 with 470454.578 GB free disk space (11 processes; 40 concurrently)

    cd unitigging/3-overlapErrorAdjustment
    ./oea.sh 1 > ./oea.000001.out 2>&1
    ./oea.sh 2 > ./oea.000002.out 2>&1
    ./oea.sh 3 > ./oea.000003.out 2>&1
    ./oea.sh 4 > ./oea.000004.out 2>&1
    ./oea.sh 5 > ./oea.000005.out 2>&1
    ./oea.sh 6 > ./oea.000006.out 2>&1
    ./oea.sh 7 > ./oea.000007.out 2>&1
    ./oea.sh 8 > ./oea.000008.out 2>&1
    ./oea.sh 9 > ./oea.000009.out 2>&1
    ./oea.sh 10 > ./oea.000010.out 2>&1
    ./oea.sh 11 > ./oea.000011.out 2>&1

-- Finished on Tue Nov  6 14:42:28 2018 (4578 seconds) with 470179.875 GB free disk space
----------------------------------------
-- Found 11 overlap error adjustment output files.
----------------------------------------
-- Starting command on Tue Nov  6 14:42:29 2018 with 470179.859 GB free disk space

    cd unitigging/3-overlapErrorAdjustment
    /general/software/x86_64/easybuild/Westmere/software/Canu/1.7-intel-2017A-Perl-5.24.0/Linux-amd64/bin/ovStoreBuild \
      -G ../moell.gkpStore \
      -O ../moell.ovlStore \
      -evalues \
      -L ./oea.files \
    > ./oea.apply.err 2>&1

-- Finished on Tue Nov  6 14:42:33 2018 (4 seconds) with 470178.921 GB free disk space
----------------------------------------
-- No report available.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'bat' concurrent execution on Tue Nov  6 14:42:33 2018 with 470178.875 GB free disk space (1 processes; 1 concurrently)

    cd unitigging/4-unitigger
    ./unitigger.sh 1 > ./unitigger.000001.out 2>&1

-- Finished on Tue Nov  6 14:42:33 2018 (lickety-split) with 470178.875 GB free disk space
----------------------------------------
--
-- Bogart failed, retry
--
--
-- Running jobs.  Second attempt out of 2.
----------------------------------------
-- Starting 'bat' concurrent execution on Tue Nov  6 14:42:33 2018 with 470178.859 GB free disk space (1 processes; 1 concurrently)

    cd unitigging/4-unitigger
    ./unitigger.sh 1 > ./unitigger.000001.out 2>&1

-- Finished on Tue Nov  6 14:42:33 2018 (lickety-split) with 470178.859 GB free disk space
----------------------------------------
--
-- Bogart failed, tried 2 times, giving up.
--

ABORT:
ABORT: Canu 1.7
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

The output "unitigging" folder does not contain a "4-unitigger" folder (which . It only contains two folders:

0-mercounts 1-overlapper and a link to "moell.gkpStore"

I can see in the lines previous to the error messages that Canu is trying to access a folder that does not exists: "cd unitigging/4-unitigger"

Please, can someone help me? Let me know if you need more information.

Regards,

Marcel

pomidorku commented 5 years ago

Thank you very much. I will start figuring out how to run arrow. Regards, I. Vilchez

pomidorku commented 5 years ago

Dr. Skoren,

Two draft assemblies for bacterial genomes generated by Canu show lengths that are very different from the reference sequences published in GenBank.

Ignat ~2.45 Mb (GenBank RefSeq) ~240x (Final coverage) 3729922 bp (length of draft assembly) Moell ~3.3 Mb (GenBank RefSeq) ~38x (Final coverage) 3129801 bp (length of draft assembly)

Some of my colleagues experienced in assembling bacterial genomes think that polishing and purging haplotigs may not help much to bring the sizes of the assemblies close to that of the RefSeq.

Their advise is that I run canu assemblies again for the genomes. It seems that I should use more stringent conditions for the overlaps. They suggested to modify the default correctedErrorRate from 0.045 to a higher value.

Do you think that modification may help to get better assemblies? What is you opinion on this matter?

Regards,

pomidorku commented 5 years ago

One more thing. Just to make sure I am doing the things the right way. When I received the output files from the sequencing facility. There were several files for each genome: For example:

m54092_180525_040741.transferdone m54092_180525_040741.baz2bam_1.log m54092_180525_040741.scraps.bam.pbi m54092_180525_040741.subreads.bam.pbi m54092_180525_040741.scraps.bam m54092_180525_040741.subreadset.xml m54092_180525_040741.subreads.bam m54092_180525_040741.sts.xml m54092_180525_040741.adapters.fasta hidden_m54092_180525_040741.run.metadata.xml hidden_m54092_180525_040741.metadata.xml tmp-file-26006104-1736-4b07-9eb5-78524643d771.txt tmp-file-2602dbb1-0f13-41f7-a22b-2318ceddddb6.txt

I am using the files XXXX_XXXXXXX_XXXXXX.subreads.bam to assemble the genomes. That is because I asked to the people that provided me the files and I was told that the adapters have already been removed form the subreads.bam files.

Regards,

skoren commented 5 years ago

The subreads.bam is what you want to use yes.

Circular chromosome/molecules are always slightly longer from the assembly because they have self-overlap. Canu will tell you in the header if the contig looks circular and the GFA will tell you how to trim it. You could also follow this pacbio guide https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Circularizing-and-trimming or use circlator.

Your Moell assembly seems in line with the reference, 3.3 vs 3.1 is not terribly different. The other one, if it is a single contig would indicate your refseq genome is probably not a good representative. If its more than one contig, you could have contaminant in the sample. You should try blasting that assembly to see what it matches.

pomidorku commented 5 years ago

Hello,

I am here again, after a long detour.

I have two draft assemblies for two genomes:

1-what I was told it is Moellerella (2 contigs, the largest one 3082147bp) 2-what I was told it is Ignatzschieniria (13 contigs, the largest one 3464258bp)

Below is what happened wit the first genome after I run a blast search (my draft genome against the NCBI genomes).

It turns out that the contigs of the genome I was told it was Moellerella, actually matched a different bacteria, Kurthia: Contig1 (3082147bp) (7229 reads) (covStat=2586.27) 91% query coverage 99% identity to Kurthia Contig2 (47654bp) (50 reads) (covStat=45.39) 20% query coverage 96% identity to Kurthia

I mapped the pacbio raw sequences (.bam files) using minimap2 against NCBI downloaded reference sequences for Moellerella and Kurthia, and 87.09% (1044443) out 1199201 reads of the presumed Moellerella aligned to Kurthia, while only the 1.55% did so to Moellerella.

I did a blast search of the presumed Moellerella draft sequence to the 16s rRNA databasse of NCBI. Only the largest contig contained a match to 16s rRNAs. The best match was Kurthia and the tree produced by blast grouped the presumed Moellerella with Kurthia.

So, I am quite confident that the draft genome sequence of what I was told it is Moellerella, it is actually Kurthia.

However:

1) only 20% of the shorter contig matches Kurthia. Is that something I have to be concerned about? I ask this because this contig may contain DNA from other bacteria. Minimap2 shows that the reads in the .bam file align to other bacteria (i.e., 1.55% align to Moellerella)

2) Also, I run Canu with an estimated genome size of 3.3m, while the genome of Kurthia is 2.8m.

3) Finally, I used Canu with a downsized 200x raw coverage, down from about a 2500x raw coverage.

Are the above points of concern for obtaining final reliable genome assemblies? We plan to use the genomes to compare the genes among several bacteria to assess if there is a loss of genes in ones that are matched with a gain of genes in others.

Best regards,

To be continued ... (following post)

skoren commented 5 years ago
  1. It might be a plasmid not present in the reference you chose or the current references (depending on how well samples those bacteria are). It is also quite possible to have contamination given that you had incorrect information on what was sequenced. I'd suggest classifying all your reads using either a classifier like Centrifuge or Kraken to see what else may possibly be in the sample.

  2. That small a difference shouldn't matter.

  3. If you have contamination that might eliminate some of the contaminating sequence if it's at lower abundance. If you have contamination and its the dominant organism in the sample, the random downsampling might only give you the contaminant sequences instead. You could try running the suggested metagenomic parameters from the FAQ to see if you get more genomes assembled. You could wait for results from classification in 1 to do this since if those method don't find anything then there probably isn't anything else to assemble.

pomidorku commented 5 years ago

Thank you for your answer.

Allow me to continue my story.

In my previous post I mentioned that I have 2 draft Canu assembled genomes. The second one is what I was told it is Ignatzschieniria3 (13 contigs, the largest one 3464258bp)

For this draft genome, when running blast against representative genomes from NCBI, Moellerella is the best match for all contigs, and Providencia the second best match. For example, for the longest Ignatzschineria3 contig (3464258bp), the first best match is Moellerella wisconsesis with 5% query coverage and 98% identity and a total score of 3.365e+05; the second-best match is Providencia with 27% query coverage and 87% identity and a total score of 1.140e+06.

The alignment of Ignatzschineria3 using minimap2 against various reference sequences produced the following:

It aligns to Moellerella wisconsensis with 80.90% (599270) out of 740728 fragments It aligns to Providencia retteri with 20.33% (133975) out of 659123 fragments It aligns to Providencia alcalifaciens with 19.46% (129215) out of 664124 fragments It aligns to Ignatzschineria larvae with 1.36% (8303) out of 609223 fragments It aligns to Ignatzschineria ureiclastica with 2.46% (15101) out of 613190 fragments

The blast search of the presumed Ignatzschineria3 draft sequence against the 16s rRNA database of NCBI shows that only the largest contig contained matches to 16s rRNAs. Te best match for the 16s rRNAs was Moellerella and the tree produced by blast grouped the presumed Ignatzschineria3 with Moellerella.

It seems to me that the sequenced Ignatzschieniria, may in fact be Moellerella, but with heavy Providencia contamination and little presence of Ignatzschieniria DNA.

I will proceed as you recommend above.

We have a job script for kraken available for our cluster. But, I assume either one (Centrifuge or Kraken) is fine for this purpose. Just one question, both Centrifuge and Kraken were designed for metagenomics (I understand that type of work involves working with small size sequences). Would that be a problem for pacbio long reads?

You suggest I try running metagenomic parameters from the FAQ section. That sounds like a good strategy to deal with contamination.

What do you think about this other alternative?

  1. Use minimap2 to align the sequences in the .bam file to a reference sequence.
  2. Extract the sequences that match the reference genome (the one I want to assemble).
  3. Use parameters for large genomes assembly to assemble my contaminated genomes (this may help to reduce the "contigging" of reads that are not from the same species). The parameters are: corMhapSensitivity=high 'corOutCoverage=100' 'batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50' 'corMinCoverage=0' 'corErrorRate=0.5'

What do you think about it?

Regards,

skoren commented 5 years ago

Having 5% query coverage seems quite low so if it is Moellerella, it seems relatively diverged. Both Kraken and Centrifuge should work relatively well with long reads, Centrifuge is used by Oxford Nanopore in their analysis pipeline for metagenomic data.

The issue with mapping to the reference is you'll blind yourself to anything different. If your genome has a novel plasmid or a large insertion then those reads just won't map and you won't assemble it. Since you're trying to look for gene absence/presence downstream then this will be an issue. In general, it's going to be very hard to confirm that a gene is present in a given bacteria given the sample mixing. You could go back and confirm the genes are in the same read but that won't be very long range info.

pomidorku commented 5 years ago

Hum,

This look like a very messy situation, specially the Ignatzschineria3 genome. I wonder if it is worth the effort to try to assemble this seemingly highly contaminated set of reads. Or it may be best to sequence again the genome (because of the downstream application we plan to use it for).

Let me just ask this one more time and I will take some time to think about it.

In you opinion, is it reasonable If I just use:

3) Use parameters for large genomes assembly to assemble my contaminated genomes (this may help to reduce the "contigging" of reads that are not from the same species). The parameters are: corMhapSensitivity=high 'corOutCoverage=100' 'batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50' 'corMinCoverage=0' 'corErrorRate=0.5'

I ask this because a computer scientist colleague of mine thinks that may be a good way to go.

Regards,

pomidorku commented 5 years ago

Is there any way to post blast graphics? I would like to show you the blast picture of the Ignatzschineria3 blast results restricted to the genome of Moellerella.

pomidorku commented 5 years ago

Dear Sir,

In your experience, is there a kraken or minimap threshold that would allow to diagnose that a contaminant is present in the raw pacbio sequences?

Can you take a look at the following files? - (attached at the end of this message)

1_Minimap2_pacbio_raw_genomes. This table shows the mapping results of raw pacbio sequences to NCBI genome sequences. Minimap2 was used for those mappings. The thin bottom borders are used to group closely related reference sequences.

I do not have experience in the interpretation of these results, therefore, I think I am playing it safe: I am "assigning" the raw reads to a genus if they map at about 80% to the reference sequence and to "contaminants" to those reads that match at least at a 20% level to the reference sequence, given that a genus can be "assigned" (according to my criterion above)?.

Is my guess reasonable?

2_Kraken2_pacbio_raw_genomes. This table shows the classification of pacbio and their equivalent illumina raw sequences. The reference sequences are grouped according to the NCBI taxonomy.

This table only shows the reference sequences to which the raw sequences classify to above the 1% level (the exception is the 0.97% highlighted in gold color).

Notice that none of the raw sequences are classified at the 100% level. Also, there are many other classification hits besides the highest percentage highlighted in green.

What hits can be called contaminants in this table?

Let me know if you can help me with this issue.

Regards,

Attachments: 1_Minimap2_pacbio_raw_genomes.xlsx 2_Kraken2_pacbio_raw_genomes.xlsx

skoren commented 5 years ago

I don't have the expertise in those tools to set cutoffs, it's is non-trivial to confidently detect contaminant, especially if the contaminant isn't present in the databases. There can also be false mappings due to similarity between your contaminant and target genome sequence. I'm not sure how similar the references you're looking at are.

I'm also not sure how to exactly interpret your criteria is for the 80% or the 20% (I assume it is fraction of read covered but not sure). As for Kraken, it has a classification filter to remove low-confidence assignments but again DB plays a big role in the accuracy of the classification. So I'd suggest asking the developers of the tools instead for guidance.

pomidorku commented 5 years ago

Thank you very much fro your help

pomidorku commented 5 years ago

Dr. Koren,

Thank you for all your help last year, when I had so many problems with the assembling of PacBio-sequenced genomes. After checking the statistics of all of those genomes, I am leaning to believe that someone at the sequencing facility messed up the assignment of the adapters to the respective genomes (they were multiplexed). Anyway, it was decided at our lab that those genomes are not usable for research purposes due to the uncertainty of their status (mislabeled? contaminated?).

The lab has sequenced other genomes by using Illumina and PacBio. I run SPAdes to assemble a couple of the illumina-sequenced genomes and canu for a couple of PacBio-sequenced ones.

I want your advise on the following:

  1. For the Pacbio sequences I plan to polish them by using Arrow and then, circularize them by using circlator, in that order. Is this the way you would advise to proceed with such genomes?

  2. For the illumina genomes, I run a test assembly by using the pipeline named "Comprehensive Genome Analysis" of PATRIC (https://www.patricbrc.org/), which does the following for the assembly step: Runs BayesHammer on reads Assembles with Velvet, IDBA and SPAdes Sorts assemblies by ARAST quality score

For those same genomes, I run also SPAdes in a super-computer.

The Illumina assemblies from PATRIC and super-computer are different in number of bases, number of contigs (scaffolds), as well as their mapping stats to the reference sequences.

I want to do the best alignment possible with the super-computer.

What step would you recommend next (e.g., using pilon to polish the sequences and then circularize them y using the appropriate tool?), or just pick ONE of the assemblies (PATRIC or super-computer) and move on to annotation (PATRIC already provides the annotation) and perform the other analyses we plan for those genomes.

Any advise will be very much appreciated.

Regards,

skoren commented 5 years ago
  1. I think that is reasonably, you probably want to polish after circulator since it will add/trim bases and you want to make sure those get polished properly too.

  2. I would pick one of the assemblies and proceed yes.

pomidorku commented 5 years ago

Dr. Koren, Thank you for your prompt response. How refreshing to get some straight advise.

Some experts do a preamble on how "you need to know what to do with your genome" and so on before not really giving any advise.

Thank you very much. I really appreciate your help.

Best regards,

pomidorku commented 5 years ago

Dr. Koren, I wonder if SMRTLink tools can be installed in a laptop. It seems that the compute requirements are too large for a laptop. If that is the case, is there any "smaller" version of it so I could install it in a laptop 1TB HDD, 16-32GB RAM, i7-8750H 2.20GHZ 6 core processor? Regards,

skoren commented 5 years ago

Have never tried installing it on a laptop so not sure, it likely won't work on windows or OS X so unless the laptop is linux I wouldn't try. You can contact PacBio to ask about SMRTlink since they develop it.

pomidorku commented 5 years ago

thank you. I run Ubuntu 18.04 of my laptop (dual boot, win 10 and ubuntu).

pomidorku commented 4 years ago

Dr. Koren,

What strategy would you advise for a hybrid (illumina + pacbio) assemble a bacterial genome ~5 million bp in size? I have the pacbio subreads.fastq and the illumina fastq reads (~100 bp and 120 bp). SPAdes 13 can perform hybrid analysis and I understand that it takes CSS and CLR pacbio data as input. I only have the subreads. Is it necessary for me to get the CSS? If so, how do I go about getting them? Just pointing me in the right direction would be enough.

Do you think there is a better option than SPAdes for hybrid genome assembly?

Regards,