marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

How to deal with Hifi, Ont UL, and hic data together? #122

Open hanqu24 opened 3 months ago

hanqu24 commented 3 months ago

Hello!

I wanted to express my gratitude for developing such excellent software! I recently assembled diploid haplotypes using pacbio hifi data, ont ultra long data, and hic data. I ran separate meryl dbs on these data and used union-sum to merge them all as the read-db.meryl input. I would like to confirm if this approach is suitable for processing.

Additionally, I have Illumina short read data available, but I did not utilize it for the assembly process. Do you think my plan is feasible?

Thank you for your time! I look forward to your guidance.

Best, Han

hanqu24 commented 2 months ago

Hello again,

I have seen your suggestion regarding hifi+NGS for QVs and NGS for spectra and completeness. Following my previous question, I did not use NGS data for assembly at all. In my case, how many coverages of NGS reads should I use for building the meryl? For example, my Hifi is 60x, ont is 60x, and hic is 60x.

Looking forward to your reply.

Thank you!

zongzone commented 2 months ago

Hello again,

I have seen your suggestion regarding hifi+NGS for QVs and NGS for spectra and completeness. Following my previous question, I did not use NGS data for assembly at all. In my case, how many coverages of NGS reads should I use for building the meryl? For example, my Hifi is 60x, ont is 60x, and hic is 60x.

Looking forward to your reply.

Thank you!

I am trying to do a hybrid build on my side, can you provide details of illumina+hifi build? Thanks!

arangrhie commented 2 months ago

Hello,

Yes I'd recommend to use kmers from HiFi and Illumina (or Element, Onso, ...). Anything over 50x is my comfort zone.

I wouldn't recommend to just union sum everything - I do spectra-cn analysis on one platform, QV and error profilings on the hybrid db. See #123 or https://github.com/arangrhie/T2T-Polish/tree/master/merqury#2-hybrid for building hybrid DBs.

hanqu24 commented 2 months ago

Hi Arang,

Thank you for your detailed explanation! I have an additional question. If I have a total of 60x HiFi, 60x ONT, 60x HiC, and 100x Illumina, I want to carry out a benchmarking analysis to determine the minimum coverages required for T2T.

I've divided the data into several combinations, such as group#1 with 30x HiFi, 30x ONT, and 30x HiC, and group#2 with 40x HiFi, 40x ONT, and 40x HiC, etc. For each group, what would be the recommended coverages of Illumina data for spectra and QV? Should group#1 use 30x Illumina to match the coverage, or use all 100x?

Thanks again!

arangrhie commented 2 months ago

If you'd like to validate assemblies from the same genome using different strategies it would be better to control the evaluation set and use the same kmers. I'd probably use 31 mers from the 100x Illumina if I had the same settings.

hanqu24 commented 2 months ago

Hi Arang,

Thank you for your input! I noticed in your wiki that a k-mer size of 21 is considered optimal for a human genome. Could you please elaborate on why you suggest using 31 mers instead in this scenario? Is it due to the high coverage of Illumina sequencing?

Many thanks!!

arangrhie commented 1 month ago

From experience, k=31 was in general more reliable with the more recent HiFi and Illumina dataset. k=21 could inflate the QV especially when the coverage is high. A sequencing error in the read could accidentally create a kmer that should not exist in the reads for example. Another example is observed in repetitive regions, where a few kmers overlapping an error bp in the assembly could exist elsewhere in another repeat copy, given the limited kmer space in the repetitive region. Longer k sizes are more conservative, but requires higher sequencing coverage to reliably collect kmers including difficult to sequence region.