mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
790 stars 168 forks source link

Use of error corrected reads #289

Closed hp2048 closed 4 years ago

hp2048 commented 4 years ago

Hi @fenderglass : I have performed error correction on pacbio rsII data using FMLRC and Illumina short reads. One of the features of FMLRC is that data may contain uncorrected reads and that a read may contain uncorrected stretches of bases within them.

Regardless, I used these error corrected reads for the assembly. In the first set, I used --pacbio-corr flag for running assemblies. Out of two samples, one sample ran fine and produced and assembly but the other sample had a failure with the following error message.

[2020-07-23 11:04:24] INFO: Starting Flye 2.7.1-b1672
[2020-07-23 11:04:46] INFO: >>>STAGE: configure
[2020-07-23 11:04:46] INFO: Configuring run
[2020-07-23 11:52:49] INFO: Total read length: 167345033693
[2020-07-23 11:52:49] INFO: Input genome size: 3000000000
[2020-07-23 11:52:49] INFO: Estimated coverage: 55
[2020-07-23 11:52:49] INFO: Reads N50/N90: 16508 / 5689
[2020-07-23 11:52:49] INFO: Minimum overlap set to 5000
[2020-07-23 11:52:49] INFO: Selected k-mer size: 17
[2020-07-23 11:52:49] INFO: >>>STAGE: assembly
[2020-07-23 11:52:49] INFO: Assembling disjointigs
[2020-07-23 11:52:51] INFO: Reading sequences
[2020-07-23 12:18:22] INFO: Building minimizer index
[2020-07-23 12:18:23] INFO: Pre-calculating index storage
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2020-07-23 12:54:44] INFO: Filling index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2020-07-23 13:33:47] INFO: Extending reads
[2020-07-23 13:38:30] INFO: Overlap-based coverage: 0
[2020-07-23 13:38:30] INFO: Median overlap divergence: 0.0837843
0% 100% 
[2020-07-24 07:14:54] INFO: Assembled 0 disjointigs
[2020-07-24 07:17:26] INFO: Generating sequence
[2020-07-24 07:18:31] ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct
[2020-07-24 07:18:31] ERROR: Pipeline aborted

Then I ran the sample that had not failed using the --pacbio-hifi flag. The command failed with the following error logs.

[2020-07-27 07:16:43] INFO: Starting Flye 2.7.1-b1672
[2020-07-27 07:16:49] INFO: >>>STAGE: configure
[2020-07-27 07:16:49] INFO: Configuring run
[2020-07-27 07:48:07] INFO: Total read length: 108017684134
[2020-07-27 07:48:07] INFO: Input genome size: 3000000000
[2020-07-27 07:48:07] INFO: Estimated coverage: 36
[2020-07-27 07:48:07] INFO: Reads N50/N90: 17787 / 6551
[2020-07-27 07:48:07] INFO: Minimum overlap set to 5000
[2020-07-27 07:48:07] INFO: Selected k-mer size: 17
[2020-07-27 07:48:07] INFO: >>>STAGE: assembly
[2020-07-27 07:48:07] INFO: Assembling disjointigs
[2020-07-27 07:48:07] INFO: Reading sequences
[2020-07-27 08:04:30] INFO: Building minimizer index
[2020-07-27 08:04:30] INFO: Pre-calculating index storage
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2020-07-27 08:15:52] INFO: Filling index
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2020-07-27 08:25:52] INFO: Extending reads
[2020-07-27 08:28:24] INFO: Overlap-based coverage: 0
[2020-07-27 08:28:24] INFO: Median overlap divergence: 0.0631229
0% 100% 
[2020-07-27 14:22:58] INFO: Assembled 0 disjointigs
[2020-07-27 14:23:51] INFO: Generating sequence
[2020-07-27 14:24:16] ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct
[2020-07-27 14:24:16] ERROR: Pipeline aborted

I have two questions regarding these trials.

  1. Is Flye capable of handling corrected and uncorrected data mixed together?
  2. I am not sure if the error is due to the wrong usage of the Flye or a bug. If it is not too much trouble, could you please let me know what may be happening?

The sample that ran successfully using --pacbio-corr method, had following assembly stats compared to the using standard raw data without error correction.

PacBio raw data assembly

    Total length:   2828504760
    Fragments:  3880
    Fragments N50:  17814282
    Largest frg:    87722289
    Scaffolds:  36
    Mean coverage:  34

PacBio data error corrected using FMLRC used for assembly

    Total length:   2105901291
    Fragments:  6940
    Fragments N50:  660073
    Largest frg:    5533276
    Scaffolds:  68
    Mean coverage:  37

From the looks of it, assembly quality is definitely lower at least in terms of contiguity. Also, the total length of the sequence is significantly lower than expected as this is a human sample. I was wondering if you could provide some insight into the correct usage of Flye for FMLRC corrected reads? Is it even possible? Main motivation is to reduce run times and assess if better and contiguous assemblies are produced when error correction is performed using Illumina data.

Looking forward to hearing from you. Cheers Hardip

GuillaumeHolley commented 4 years ago

Hello hardip,

I have been working quite a lot with hybrid corrected reads and from my experience, Flye has no issues with assembling reads containing stretches of corrected/uncorrected regions. I had always used the --ont-corr option which worked just fine for me but according to this Twitter thread, the option --pacbio-hifi also works if the correction is good enough.

Shameless plug: What stated above is based on my experience with hybrid corrected reads from Ratatosk, a tool I have designed. Code is here and preprint is available here. Happy to help if you want to give it a shot.

hp2048 commented 4 years ago

Hi @GuillaumeHolley : Apart from the errors in running the Flye, I noticed the change in the assembly quality. Please have a look at the stats for PacBio raw data assembly and the PacBio FMLRC Corrected reads assembly using --pacbio-corr flag. Do you notice this difference in assembly stats between --ont-raw and --ont-corr flags? Thank you for your suggestion of Ratatosk. I have recently read the preprint (not in depth though) and I was planning to use it. However, I need to first understand Flye a little better because assembly stats are drastically different between the two modes. Cheers Hardip

GuillaumeHolley commented 4 years ago

Hey @hp2048,

It could be a Flye issue. But from the look of it, I would say that your FMLRC corrected reads are not "correct" enough to be used with the flag --pacbio-corr which assumes an error rate below 3%. It was probably borderline okay for your first data set, the one which was successfully assembled with low quality, but not enough for the second data set. Probably if you try assembling the FMLRC corrected PacBio reads that have already successfully assembled with low quality but using --pacbio-raw rather than --pacbio-corr, you will notice a quality improvement over the raw assembly. Or you can just compute the error rate of your PacBio data sets before and after correction to see where you stand.

Cheers, Guillaume

hp2048 commented 4 years ago

@GuillaumeHolley : Thank you for your kind help. I have only done spot checks on the error rates after correction and reads do align to the reference genome at >99% identity when and where they align. However, as you suggest, it would be a good idea to understand the error rates of FMLRC corrected reads.

mikolmogorov commented 4 years ago

Hi @hp2048,

Flye expects <3% error rate in -corr mode, and <1% in hifi mode. If these assumptions are not true, we can't guarantee anything will be assembled.

I think most likely something might be wrong with reads corrected by FMLRC. Since I am not the author of that tool, I don't know what to suggest. I would try to assemble using raw reads (which is the recommended option anyways). For your second sample, raw reads mode clearly produced better results.