Closed deoliveira86 closed 1 year ago
I think your data is lower quality (or at least in terms of matching between reads). This could be due to the PCR amplification beforehand. There's essentially no overlaps and nothing to reconstruct at typical HiFi error rates. That is why you had so many consensus jobs, there were lots of very short nodes composed of just a few (or 1) reads.
I'd suggest turning on trimming and increasing the error rate (correctedErrorRate=0.025 -untrimmed
). You could try several error rates, the default if 0.01, so I'd try the above plus maybe 0.015, 0.03, and 0.05.
Hello @skoren,
thanks one more time for the prompt answer. I will wait for my hifiasm results and compile all information I have before contacting the sequencing facility to try to pin point where the issue is. Meanwhile I will give it a go with your suggested canu parameters. I will close this issue by now and updated you with the result afterward. Maybe it will be helpful to someone facing the same difficulties I am right now.
Best, André
Hello @skoren,
I did some digging and your comment about the data being lower in quality, when compared to a standard HiFi library, is correct. When I increased the error rate to 2.5% and add the -untrimmed parameters I got an assembly. I really fragmented one (154k contigs and N50 of 35k), but still some things were reconstructed. Please see the trimming logs when 0.025 error rate was specified:
trimmed (2.5% error rate)
[UNITIGGING/OVERLAPS]
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 16833 0.16 6930.20 +- 2279.10 463.30 +- 606.55 (bad trimming)
-- middle-hump 5610 0.05 4084.91 +- 1920.38 1046.51 +- 1237.10 (bad trimming)
-- no-5-prime 36595 0.34 5419.54 +- 2467.18 333.23 +- 558.40 (bad trimming)
-- no-3-prime 17919 0.17 5829.36 +- 2414.60 403.62 +- 635.87 (bad trimming)
--
-- low-coverage 343167 3.22 4246.45 +- 2115.61 10.35 +- 4.81 (easy to assemble, potential for lower quality consensus)
-- unique 4256967 39.97 5115.65 +- 2215.88 53.59 +- 15.86 (easy to assemble, perfect, yay)
-- repeat-cont 1149297 10.79 4226.55 +- 2004.19 1530.70 +- 1562.14 (potential for consensus errors, no impact on assembly)
-- repeat-dove 6431 0.06 9358.10 +- 2267.42 773.91 +- 787.84 (hard to assemble, likely won't assemble correctly or even at all)
--
-- span-repeat 1295225 12.16 6656.00 +- 2157.41 2170.91 +- 1862.89 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 2609816 24.51 5479.84 +- 1860.06 (should be uniquely placed, low potential for consensus errors, no impact on assembly)
-- uniq-repeat-dove 705781 6.63 8237.57 +- 1966.66 (will end contigs, potential to misassemble)
-- uniq-anchor 198710 1.87 6701.58 +- 2120.72 2449.79 +- 1960.48 (repeat read, with unique section, probable bad read)
As you can notice, trimming actually decrease greatly the middle-missing, middle-hump, no-5-prime, no-3-prime percentages, while increasing the low-coverage and unique. So, this conclusive shows (at least to me) that higher sequencing errors than expected are present in my data. To check how high are those error rates I used pbmm2 against a high-quality draft reference I have in-house and calculated the error rate using the gap-compressed sequence identity . If you consider my reference as being "nearly" perfect (after polishing with arrow at a 70X CLR data coverage) the assumed error rate in my HiFi libraries are ~3.35%. I also checked the error rate in my trimmed canu fasta file and it is about ~3.32%, which is still quite high even after trimming. Is this normal?
Now that I have a clearer picture about what is happening in my data I would set another canu run with the parameters: correctedErrorRate=0.045 -untrimmed maxInputCoverage=100
. Do you recommend any other parameters or a higher error correction?
Best regards, André
Trimming won't improve the read quality, just remove bad sequence on the ends that don't have support from other reads. If the whole read is at a uniform error rate of 3.5% it will stay that way after trimming. I think you'd need the corrected error rate to be higher than 4.5%, you want to approximately double your read error rate so 6.5% is probably more reasonable. You could set the trimming one to 4.5 and the final one to 6.5 (correctedErrorRate=0.035 utgOvlErrorRate=0.065
) or just set corrected to 6.5 (correctedErrorRate=0.065
).
Hello @skoren,
thanks one more time for the feedback. I will run canu considering your parameters and I come back later to inform you about the outcome. I will close this issue now.
Dear @skoren,
After months of troubleshooting and many assembly runs, I identified with the help of Pb bioinformaticians what was causing the issues with the assemblies and it was the employment of the ultra-low input HiFi protocol for sequencing my target species. The genome is relatively large and full of repeats which due the PCR-based nature of the protocol did not cover the whole breadth of the genome, causing the massive fragmentation and suboptimal results.
We have sequenced now using the low-input protocol five different individuals and the individual results are much better in terms of completeness and contiguity. However, since each low-input SMRT cell yields around 10-13x coverage, the draft assemblies are a bit fragmented and I would like to pool all the libraries together and obtain a more contiguous genomic reference. My attempts so far with the 2, 3, 4 and the 5 pooled libraries with hifiasm have been not so successful. The pooled genome assembly is highly fragmented and the reconstructed genome size is way bigger than expected. I am assuming that this might be due the heterozigosity and the generation of a multitude of haplotigs during the sequence reconstruction. I am currently playing around with hifiasm and adjusting the purge duplication parameters to see if I have any improvements.
Additionally, I would like to try Hi-Canu as well, but I wonder if I will face the same similar issue I am having with hifiasm. Is there a way to generate a "squashed" haploid assembly ignoring the different haplotigs and heterozigosity using Hi-Canu fine-tuning the parameters? Or the haplotig purging must be done separately using third party software like "purge_dups"?
Below you can find the genomescopeplots using the five pooled libraries:
Best regards and thanks again for the help, André
You'd likely face similar issues. You can increase the error rate in HiCanu, as you did previously, but that loses one of the main advantages of HiFi data (accuracy). You'd also end up with an assembly mixing different individuals throughout the contigs which is probably not what you want. Your best option is probably to run the samples separately and then compare the assemblies and take the best representation of a genomic region from all samples. Random luck or coverage variation might mean one region of a genome is better assembled in one sample than another. That is, chr1 might come from sample 1 while chr2 is from sample 2. This would still result in a final assembly mixing individuals but at least those switches would be outside the contigs (the contigs would only mix between the two haplotypes within an individual).
Hello @skoren,
Thanks for answering so fast and for the advice. I agree that I might have a better chance of increasing the contiguity of my reconstructions by post-processing the individual assemblies rather than trying to assemble all the pooled five samples together. Do you have any suggestion of such tool that would recognize these homologous genomic regions and extract the best representative? I am aware of Ragtag (https://github.com/malonge/RagTag), however, I am unsure with I can provide multiple assemblies at once, most likely I need to do this in a pairwise and stepwise manner.
I am not aware of a tool to do the multi-way analysis, no. I would suggest picking one assembly (the most continuous/accurate by QV) and aligning the others against it.
Hello @skoren, okay. I will try to implement your suggestions and as soon as I have an outcome I let you know. I am closing this issue now. Cheers!
Dear all,
I have recently acquired many HiFi libraries through a ultra-low insert library including PCR protocol from a single metazoan specimen. I followed the recommended pre-processing steps and performed everything in house from the native
reads.bam
file:"rq":">=0.99"
and"rq":">=0.999"
, respectively). For convenience, the HiFi reads with QV>=30 I will call ultra-HiFi (uHiFi) from this point onwards.Before running the assembly, I did some sanity checks with the data, namely:
Canu v2.2
using Sequel recommended parameters, and was polished, purged, and any contaminants removed with blobtools) with minimap2. The mapping rate was >94%, confirming that the HiFi data correspond to the organism which I am interested and there are likely no contamination in the data.To summarise some basic info about the genome:
With this previous information I ran
Canu v2.2
with the following command line:canu -p uhifi-100x -d uhifi-100x genomeSize=1.265g maxInputCoverage=100 uHiFi_libraries.*gz
Everything ran normally (and the histogram plots agree with the GenomeScope prediction - see below) until the computation of the cns jobs.
Thousands and thousands of cns instances were submitted to the cluster and I could see that after bogard assembler run, no reconstructed contig was generated. See log below from the canu.report file:
From the report above, it is pretty clear that almost every all the HiFi reads could not be assembled, which is making me pretty worried about the quality of the data and making me extremely confused. From the Unittiging/overlap categories (below), I can see that Canu could identify more reads for the genome reconstruction, and despite the level of unique reads is quite low (2.50%), the low-coverage and other category reads could be used for assembling.
Because I am using uHiFi reads the error rates should not be high and the Error rate log (if I understood it correctly), tells me that the error rate is not an issue.
The Best edge filtering log tells me that only a small percentage of reads have at least one edge (0.06%), indicating that virtually all my HiFi reads have no useful overlap. This is scary and I wonder why?
Finally, since almost no overlap was identified Canu produced any output and for some reason launched thousands of cns jobs which I killed after more than 50,237 *out files were written in the unitigging/5-consensus/ folder. Furthermore, I could see that in the uhifi-full.ctgStore/ folder there is a total of 87,022 partitions. Why cns jobs would be submitted without any reconstructed contigs?
I investigated the unitigger.err log file, but I couldn't properly interpret the results. I will attach here, that could may be useful.
Does anyone have an idea what is happening? My guess would be that playing around with bogart parameters (e.g., overlapping generation limits and overlap processing limits) can actually help me to produce an assembly (well, it cannot get any worse, since I had no output). Additionally, I am running now many instances of hifiasm to see how the software will perform with the data. I imagine if the hifiasm results (if any) are also not great, something must be wrong with the data.
Furthermore, does anyone have any idea what could go wrong with the library preparation? Since the quality of the libraries is good (in terms of QV) and my N50 HiFi read length is not bad (~10Kb), the source of the issue must come from the ultra-low library prep (I guess??). The extremely high repeat content of the genome I am working can also cause assembly issues, but, one more time, researchers can go through telomeric sequences using HiFi and that does not impede them to have at least assembled sequences.
I appreciate the help, Best, André