GenomeScope Output from PacBio Hifi ccs Reads is Confounding

schatzlab / genomescope

Fast genome analysis from unassembled short reads

Apache License 2.0

238 stars 56 forks source link

GenomeScope Output from PacBio Hifi ccs Reads is Confounding #126

Open foreignsand opened 2 months ago

foreignsand commented 2 months ago

I ran jellyfish count on ccs reads produced from PacBio HiFi sequencing as follows:

jellyfish count -C -m 31 -s 1000 -t 10 -o marten.pb.jf m64047_240219_192555.ccs.fasta

And subsequently ran jellyfish histo multiple times using various values (default, 1 million, 10 million) for -h:

jellyfish histo marten.pb.jf -t 16 > marten.pb.histo
jellyfish histo -h 1000000 marten.pb.jf -t 32 > marten.pb.h1m.histo
jellyfish histo -h 10000000 marten.pb.jf -t 32 > marten.pb.h10m.histo

The GenomeScope output for all 3 runs is very similar. The genome size estimate is roughly half what it should be, so half the haploid genome size, and the linear plot is completely bonkers:

GenomeScope

I'm very new to this, so if anyone has any advice as to what I might be doing wrong or what might be going on it would be much appreciated!

foreignsand commented 2 months ago

Now that I've been looking at this a bit, I realize I made a mistake with the jellyfish count run. My value for -s was way off. Although now I'm having a difficult time determining how to calculate a value for this parameter. The documentation says:

The size parameter (given with -s) is an indication of the number k-mers that will be stored in the hash. For sequencing reads, this size should be the size of the genome plus the k-mers generated by sequencing errors. For example, if the error rate is e (e.g.Illumina reads, usually e~1%), with an estimated genome size of G and a coverage of c, the number of expected k-mers is G+Gcek.

So if the estimated genome size is 2.2 Gb and the coverage (for my PacBio Hifi sequencing?) was ~80X, the error rate was 1%, and the k-mer size I'm using is 31, should the value for -s be 2200000000 + 2200000000 80 1 * 31 or 5458200000000? That can't be right. That seems completely insane...

foreignsand commented 2 months ago

Continuing on with this, I reran jellyfish count with -s 50G and got the same results.

GenomeScope_s50g

Attempting to run jellyfish count with -s 5000G and -s 500G failed due to memory issues, so I'm attempting it with -s 100G. If that doesn't produce something that makes more sense, I'm not entirely sure how to resolve this.

foreignsand commented 2 months ago

I'm still getting the same issue. I'd love to know why that might be.