Closed hp2048 closed 4 years ago
Hello hardip,
I have been working quite a lot with hybrid corrected reads and from my experience, Flye has no issues with assembling reads containing stretches of corrected/uncorrected regions. I had always used the --ont-corr
option which worked just fine for me but according to this Twitter thread, the option --pacbio-hifi
also works if the correction is good enough.
Shameless plug: What stated above is based on my experience with hybrid corrected reads from Ratatosk, a tool I have designed. Code is here and preprint is available here. Happy to help if you want to give it a shot.
Hi @GuillaumeHolley : Apart from the errors in running the Flye, I noticed the change in the assembly quality. Please have a look at the stats for PacBio raw data assembly and the PacBio FMLRC Corrected reads assembly using --pacbio-corr
flag. Do you notice this difference in assembly stats between --ont-raw
and --ont-corr
flags? Thank you for your suggestion of Ratatosk. I have recently read the preprint (not in depth though) and I was planning to use it. However, I need to first understand Flye a little better because assembly stats are drastically different between the two modes.
Cheers
Hardip
Hey @hp2048,
It could be a Flye issue. But from the look of it, I would say that your FMLRC corrected reads are not "correct" enough to be used with the flag --pacbio-corr
which assumes an error rate below 3%. It was probably borderline okay for your first data set, the one which was successfully assembled with low quality, but not enough for the second data set. Probably if you try assembling the FMLRC corrected PacBio reads that have already successfully assembled with low quality but using --pacbio-raw
rather than --pacbio-corr
, you will notice a quality improvement over the raw assembly. Or you can just compute the error rate of your PacBio data sets before and after correction to see where you stand.
Cheers, Guillaume
@GuillaumeHolley : Thank you for your kind help. I have only done spot checks on the error rates after correction and reads do align to the reference genome at >99% identity when and where they align. However, as you suggest, it would be a good idea to understand the error rates of FMLRC corrected reads.
Hi @hp2048,
Flye expects <3% error rate in -corr
mode, and <1% in hifi
mode. If these assumptions are not true, we can't guarantee anything will be assembled.
I think most likely something might be wrong with reads corrected by FMLRC. Since I am not the author of that tool, I don't know what to suggest. I would try to assemble using raw reads (which is the recommended option anyways). For your second sample, raw reads mode clearly produced better results.
Hi @fenderglass : I have performed error correction on pacbio rsII data using FMLRC and Illumina short reads. One of the features of FMLRC is that data may contain uncorrected reads and that a read may contain uncorrected stretches of bases within them.
Regardless, I used these error corrected reads for the assembly. In the first set, I used
--pacbio-corr
flag for running assemblies. Out of two samples, one sample ran fine and produced and assembly but the other sample had a failure with the following error message.Then I ran the sample that had not failed using the
--pacbio-hifi
flag. The command failed with the following error logs.I have two questions regarding these trials.
The sample that ran successfully using
--pacbio-corr
method, had following assembly stats compared to the using standard raw data without error correction.PacBio raw data assembly
PacBio data error corrected using FMLRC used for assembly
From the looks of it, assembly quality is definitely lower at least in terms of contiguity. Also, the total length of the sequence is significantly lower than expected as this is a human sample. I was wondering if you could provide some insight into the correct usage of Flye for FMLRC corrected reads? Is it even possible? Main motivation is to reduce run times and assess if better and contiguous assemblies are produced when error correction is performed using Illumina data.
Looking forward to hearing from you. Cheers Hardip