Open foreignsand opened 2 months ago
Now that I've been looking at this a bit, I realize I made a mistake with the jellyfish count
run. My value for -s
was way off. Although now I'm having a difficult time determining how to calculate a value for this parameter. The documentation says:
The size parameter (given with -s) is an indication of the number k-mers that will be stored in the hash. For sequencing reads, this size should be the size of the genome plus the k-mers generated by sequencing errors. For example, if the error rate is e (e.g.Illumina reads, usually e~1%), with an estimated genome size of G and a coverage of c, the number of expected k-mers is G+Gcek.
So if the estimated genome size is 2.2 Gb and the coverage (for my PacBio Hifi sequencing?) was ~80X, the error rate was 1%, and the k-mer size I'm using is 31, should the value for -s
be 2200000000 + 2200000000 80 1 * 31 or 5458200000000? That can't be right. That seems completely insane...
Continuing on with this, I reran jellyfish count
with -s 50G
and got the same results.
Attempting to run jellyfish count
with -s 5000G
and -s 500G
failed due to memory issues, so I'm attempting it with -s 100G
. If that doesn't produce something that makes more sense, I'm not entirely sure how to resolve this.
I'm still getting the same issue. I'd love to know why that might be.
I ran jellyfish count on ccs reads produced from PacBio HiFi sequencing as follows:
And subsequently ran jellyfish histo multiple times using various values (default, 1 million, 10 million) for -h:
The GenomeScope output for all 3 runs is very similar. The genome size estimate is roughly half what it should be, so half the haploid genome size, and the linear plot is completely bonkers:
I'm very new to this, so if anyone has any advice as to what I might be doing wrong or what might be going on it would be much appreciated!