marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
289 stars 29 forks source link

Too many missing genes #263

Closed hanqu24 closed 2 months ago

hanqu24 commented 3 months ago

Hi Sergey,

Thank you for creating such an impressive assembly tool! I recently tested both versions 2.0 and 2.1 using various genome coverages. However, I found that over 100 BUSCO genes are missing, even at 50-60x coverage. To test the stability of Verkko in terms of gene completeness, I randomly resampled 60x HiFi data to see if the results remained consistent. Unfortunately, the results varied significantly. We observed fewer unassigned bins compared to v2.0. However, it did not fully resolve the issue of missing genes in this case. More detailed inputs and outputs are listed below:

Inputs: 50x pacbio hifi, 50x ONT, 50x Hi-C 60x pacbio hifi, 60x ONT, 60x Hi-C 60x_2 pacbio hifi, 60x ONT, 60x Hi-C

Outputs: image

Please let me know if need more information. Thanks a lot! Flora

skoren commented 3 months ago

Thanks for the info. I wanted to clarify some of the table entries. Are the missing genes counted in each haplotype and then summed across both? What do missing genes-X and missing genes-Y denote? Is this an XX or XY sample?

Are you able to share the raw data for this sample? If not, the assembly.homopolymer-compressed.noseq.gfa, assembly.colors.csv, assembly.paths.tsv and assembly.scfmap from your 2.1 assemblies should be enough to take a look locally at what's going on and should be small enough to upload here.

hanqu24 commented 3 months ago

Hi Sergey,

I'm sorry for the confusion. My sample is male. "Missing genes-X" refers to the missing genes of the haplotype with the X chromosome. Similarly, "missing genes-Y" refers to the missing genes of the second haplotype with the Y chromosome. It's important to note that the haplotype with the Y chromosome often shows more missing genes, some of which are missing genes on the X chromosome. This part has been filtered out. Therefore, "missing genes-X" and "missing genes-Y" should be referred to as "missing genes-Haplotype with X" and "missing genes-Haplotype with Y", respectively.

I've attached the files you need for 60x and 60x_2 below. Thank you for your time! Please let me know if you require any additional files. 60x_2.zip 60x.zip

Thanks again!

skoren commented 3 months ago

In these cases, it looks like the difference is ChrX is either in one piece and phased or is split into two pieces with one part remaining unassigned. This should be addressed by the upcoming verkko release which will scaffold these types of components together. Until then, it should also be resolved by running yak's sexchr function (https://github.com/lh3/yak) to find sequences with sex chromosome markers in the unassigned bin and move them if needed.

I am curious why this gap is appearing/disappearing with re-run of the same data, seems like some unintended randomness in verkko's resolution. I'd like to try running this locally as a test if you're able to share the raw input data. You can upload it here: https://canu.readthedocs.io/en/latest/faq.html#how-can-i-send-data-to-you or point to an AWS/cloud link if it's already available somewhere.

hanqu24 commented 3 months ago

Hi Sergey,

Thanks a lot for your answer. I'm looking forward to the new release! Actually, the PacBio HiFi data have the same coverage but are still different (randomly sampled again). Would unintended randomness still be a problem? I’ve tried several times to upload the data through FTP, but I keep receiving the failed message below. Is there a way to fix this?

image

Thanks again!

skoren commented 3 months ago

Ah different reads could lead to different results, yes. ChrX has several regions that HiFi tends to not sequence well so it's possible it's a real difference between the extracted reads. I'd still like to check why the ONT isn't patching it. The FTP is read-only for everyone but me so once a file transfer fails it needs a new name since you don't have permission to overwrite the failed file.

hanqu24 commented 3 months ago

Ok, I get it. It seems the FTP portal only accepts the name ‘anonymous’. No matter what other names I try, it shows login failed.

image