marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
659 stars 179 forks source link

Assembly Improvement #1161

Closed calandryll closed 5 years ago

calandryll commented 5 years ago

I am working on assembling the genome of a phytoplankton. We have estimated the genome size to be approximately 150 Mbp. In addition to the PacBio RSII sequences, I have single-end Illumina RNAseq reads from a previous experiment, as well as access to a consensus assembly of the transcriptome from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). From my RNAseq experiment, I created a de novo assembly using Trinity to produce an assembly with ~14,000 contigs. Several contings from the RNAseq assembly were selected, primers created and differential expression tested via qPCR.

With that background knowledge, prior to assembling the genome, I removed reads that mapped to the chloroplast and mitochondria in hopes of getting a better assembly, in addition removed any potential reads that were mapped to bacterial sequences since this was a non-axenic culture. All commands were run using canu 1.7.1. And after reading through several suggestions (#254, #221, and FAQ), I ran through 5 correction iterations using the following command, modifying the raw option for the previous corrected reads:

canu \
    -correct \
    -d ha-150-r1 \
    -p heterosigma \
    -pacbio-raw /media/science/heterosigma/originals/heterosigma_final.fasta \
    corMinCoverage=0 \
    corMhapSensitivity=high \
    corOutCoverage=500 \
    genomeSize=150m

The final report of the 5 correction round can be found here.

After the iterations of correction, I went through trimming and assembly:

canu \
    -trim-assemble \
    -d ha-150-t2 \
    -p ha-e105 \
    -pacbio-corrected /media/science/heterosigma/assemblies/ha-150-r5/heterosigma.correctedReads.fasta.gz \
    genomeSize=150m \
    minReadLength=500 \
    correctedErrorRate=0.105 \
    corMhapSensitivity=high \
    corMinCoverage=0 \
    corOutCoverage=500

Report here

The resulting assembly gives a similar number of contigs that I saw in the Trinity assembly. Using the reads from my previous transcriptome experiment and data from the MMETSP, alignments rates of the reads to the canu assembly were ~30% for both datasets. Additionally, taking the primers that were verified by qPCR and searching the canu assembly find only a handful within the assembled file. They do appear within the unassembled files but coverage is low, i.e. 1 or so.

Are there any suggestions on improving the assembly? Once concern I have is there does not appear to be a distinct peak in the kmers in the unitigging. Is this a potential size of heterozygosity or poor sequencing data? Any help is greatly appreciated. If more information is needed within the directories, etc. can be found here.

skoren commented 5 years ago

Your report isn't visible, you can upload them directly to the issue instead. You've got most of the suggested parameters, except the heterozygous ones but those wouldn't help improve the k-mer histogram.

calandryll commented 5 years ago

I had hoped that I could use my server to show the reports.

After 5 iterations of correction: heterosigma.txt

Trim and Assembly: ha-e105.txt

skoren commented 5 years ago

It seems from the reports most of the reads have overlaps but they aren't leading to very good correction. You are losing 50% of your coverage in trimming even at 10% error and after five rounds of correction. The pacbio reads should be over 98% identity after a single round for most cases so this is quite suspicious. Your assembly is also already 300mb which is double the expected size.

How confident are you in the genome size estimate? There is maybe a peak at 3-4x coverage in the corrected reads (you could put the corrected read histogram file into genome scope to estimate genome size/read accuracy though it might not work with that histogram). That would imply closer to a 1gb genome size. This is also consistent with the low-coverage overlaps in the report:

--   low-coverage        83543   15.35     3838.53 +- 3132.07          3.29 +- 1.43       (easy to assemble, potential for lower quality consensus)

and the fact that your assembly of 300mb has only 30% of the transcripts mapping. A large fraction of your reads (7%) also have no overlap off one end, the rates of reads with bad middle junctions is significantly lower than this so this again seems to imply a coverage issue:

--   no-5-prime          20007    3.68     8984.75 +- 5133.38        162.62 +- 486.69     (bad trimming)
--   no-3-prime          19060    3.50     8946.76 +- 5219.98        173.77 +- 505.99     (bad trimming)
calandryll commented 5 years ago

Thanks for the helpful information. I am not very confident in the genome size estimation. When we first started working on the assembly there was some discussion about genome size. One researcher in the field suggested this algae has an estimated genome size of half the human genome ~1.5 Gbp. One of the first iterations that I ran was based on that information but before I delved further into optimizing the anaylsis, i.e. multiple run corrections, etc.

canu \
    -d ha-1500-cor80 \
    -p heterosigma \
    -pacbio-raw /media/science/heterosigma/originals/heterosigma_wout_bac_organelles.fasta \
    genomeSize=1500m \
    corOutCoverage=80 \
    corMhapSensitivity=normal \
    minReadLength=500

1500 Gbp Report

However, I moved away from this size after looking into related algal species and their assembled or estimated genome sizes were closer to 150 Mbp.

I will try using a 1 Gbp setting. I'll try going through the multiple iterations of corrections then trim and assembly.

skoren commented 5 years ago

The genome size setting isn't going to change the assembly. If the genome really is 1gb or larger, you just don't have enough coverage to get an assembly. We recommend at least 20x for Canu.

If you have Illumina data from your organism I'd try to use that to estimate the genome size rather than relying on related species. You could also try using genomescope on the histogram file produced by canu (or post the unitigging/0-mercounts/*.histogram file here and I can try it).

calandryll commented 5 years ago

Histrogram Thanks, it seems that genomescope does like when I try uploading. I unfortunately was not part of the original extraction and sequencing, and I figured we would have an issue with coverage. I'll have to talk with the other researchers about looking into Illumina sequencing or getting another round of PacBio sequencing.

skoren commented 5 years ago

Nothing useful from genome scope here, the peak is too low and not clear enough.

It is possible the genome is smaller and you have very high heterozygosity in the pooled sequencing. However, I wouldn't expect that to give you only 30% of the transcriptome. I'd check the transcriptome mapping to confirm the identities are good (implying the data corrected ok) and check for duplicate genes (which should be common if the genome is 150mb and the asm is 300mb). If the identities are good and there isn't duplication, I'd expect the genome is larger than you expect and I'd try to get a better estimate. You can try hybrid assembly with 5x or fill gaps in an illumina assembly, you just can't generate pacbio-only assemblies at such low coverage.

calandryll commented 5 years ago

Thanks for all the help. I'll talk with the others in the group about what our next steps will be.