schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
258 stars 56 forks source link

Model and observations don't converge #127

Closed malvaradol closed 6 months ago

malvaradol commented 7 months ago

Hi there,

I was wondering if I could get some help with figuring out what is going on here. I'm running genomescope on a dataset of HiFi reads of ~60X coverage, here's the script I'm using:

k=31

mkdir -p /sc/arion/projects/MML/triatoma/draft_assemblies/assembly_hifi/0_genome_properties/kmc_temp_dir_31

kmc -k$k -t16 -m160 -ci1 -cs10000 /sc/arion/projects/MML/triatoma/raw_data/hifi_data/fastq_from_bam/hifi_raw_reads_tdimi.fq.gz output_31mer /sc/arion/projects/MML/triatoma/draft_assemblies/assembly_hifi/0_genome_properties/kmc_temp_dir_31/

kmc_tools transform output_31mer histogram reads.histo -cx100000

genomescope2 -i reads.histo -o output_ploidy2 -k $k -p 2 1> genomescope_ploidy2.out 2> genomescope_ploidy2.err

I already tried to change the -cx value from 10.000 to -cx100000, but it did not provide any changes. Here's the plot I get:

transformed_linear_plot

Also, I tried with 21 kmers, here's the plot:

transformed_linear_plot

Here are the .histo files if they can be of any help:

reads.histo.txt reads_21.histo.txt

Appreciate any help!!

malvaradol commented 7 months ago

I also forgot to mention that the size of the genome estimation for both graphs is very different to what's been reported. The genome size reported spans between 925Mbp and 1.05Gb as this organism has multiple cytotypes.

mschatz commented 6 months ago

There is something unusual about your histogram file - all of the odd kmer counts have a value of zero:

$ head reads.histo.txt

1 0 2 273466582 3 0 4 101884369 5 0 6 119051792 7 0 8 135336019 9 0 10 142023371 11 0 12 138532293 13 0 14 125153138 15 0 16 106433147 17 0 18 85885224 19 0 20 68523317 21 0 22 56314167 23 0 24 50562698 25 0 26 50043002 27 0 28 52767577 29 0 30 56574373

Im guessing there was a technical error such as your reads file was accidentally duplicated so that every read occurs twice - this will take kmers that should occur once and make them occur twice, kmers that should occur twice and make them occur 4 times, etc. Can you double check your input files?

Good luck!

Mike

On Thu, Apr 18, 2024 at 11:45 AM Mateo Alvarado @.***> wrote:

I also forgot to mention that the size of the genome estimation for both graphs is very different to what's been reported. The genome size reported spans between 925Mbp and 1.05Gb as this organism has multiple cytotypes.

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/127#issuecomment-2064301336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP345G5ONB54G6BRRDRVLY57TABAVCNFSM6AAAAABGNRC4E6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRUGMYDCMZTGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

malvaradol commented 6 months ago

Hi Mike,

You are definitely right! I checked other issues to see what could have been happening, and I had the same problem where transforming a .bam file to .fastq duplicated my sequences for no reason. As suggested by the person in the issue, transforming the files with pbtk did the job.

Thanks for your response.