schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
249 stars 56 forks source link

Heterozygous peak identified as errors? #131

Open Haoran-Xue opened 4 months ago

Haoran-Xue commented 4 months ago

Hello,

I ran kmc and kmc_tools with PacBio HiFi sequences of a diploid plant species: kmc -m128 -k21 -t40 -ci1 -cs10000 xxx.hifi.fastq.gz xxx xxx_tmp kmc_tools transform xxx histogram xxx.histo

Then I submit the histo file to GenomeScope2.0 (http://genomescope.org/genomescope2.0/), with "K-mer length: 21, Ploidy: 2, Max k-mer coverage: -1, Average k-mer coverage for polyploid genome: -1".

This is the linear plot I got:

linear_plot

It seems that the fist peak (heterozygous peak) was identified as errors. Is there any way to avoid this?

Thank you!

fperezcobos commented 3 months ago

Hi,

I had the same problem, PacBio HiFi sequences of a diploid plant species and the plot looks like this:

Brassica_linear_plot

Any help?

SamCT commented 2 months ago

Also seeing this with one of our genomes. Of a lot of four Revio SMRT cells (all the same species) one plot looks like the above. This particular SMRT cell that has this plot has a much higher number of reads than others, but besides that nothing stands out. The other three plots looked reasonable. I'm curious to know what is causing this

mschatz commented 1 month ago

The automatic model fitting algorithm can get confused if you have too high of coverage or if there is ambiguity in the relationships between the homozygous and heterozygous peaks. The easiest way to address is to use the "Average k-mer coverage for polyploid genome" parameter which gives a hint as to where the first peak (heterozygous peak) is located. For these datasets I would try with a value of about 100. If that doesnt work, the next easiest thing to do is downsample the read dataset to reduce the coverage. From a raw read file, you can just use 'head' to select the first N lines in the file to reduce the number of reads, which serves as a random downsample (assuming the reads have not been aligned or other processing has happened)

Good luck!

Mike

On Fri, Jun 21, 2024 at 1:54 PM Sam Talbot @.***> wrote:

Also seeing this with one of our genomes. Of a lot of four Revio SMRT cells (all the same species) one plot looks like the above. This particular SMRT cell that has this plot has a much higher number of reads than others, but besides that nothing stands out. The other three plots looked reasonable. I'm curious to know what is causing this

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/131#issuecomment-2183189954, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP3422FVNQNHP7PRLWCPLZIRSDXAVCNFSM6AAAAABHQJLLMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTGE4DSOJVGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>