schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
258 stars 57 forks source link

Estimate good max k-mer coverage #104

Open wqshi opened 1 year ago

wqshi commented 1 year ago

Thanks for developing the easy use software!

We found the genome size estimate of our arthropod assembly was quite dependent the max k-mer coverage . The genome size is 30% larger after increasing the max coverage from 10,000 to 1,000,000. I'm wondering how to choose this parameter?

max coverage= 10,000: http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=l6usDX5yKQ0JcJlleCyi image

max coverage= 1,000,000: http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=sXLrRerDj6WyoGgKPy8P image

I notice that the uniq estimation would decrease from 30.4% to 21.7% after increasing the cut off. This means that the genome has very high proportion (~80%) of repetitive elements, which is not expected in the arthropod family. Could we use the uniq as a guide to select a reasonable genome size if we have an expected proportion of repetitive elements (~60%)?

Another issue is the high heterogeneous rate of 3.67% and mini mode fit <40%. We did blastn 100,000 reads, and find the second largest group of hits are from plants. Does this impact the genome size estimation? Do we have to remove these unwanted reads?

Thanks for your attention!

Best regards, Wenqiang RP3-2M.100w.histo.txt

tbenavi1 commented 1 year ago

Hello,

For your data I would suggest using the max coverage of 1,000,000. For some other datasets there is evidence of contamination (for example, a large peak at high coverages). Looking at your log plot, there is no evidence of large contamination, thus we suggest max coverage of 1,000,000. If there were evidence of contamination, we would recommend lowering the max coverage below the contamination peak so as to not artificially inflate the genome size estimate.

Looking at the linear plot, it seems that the genomescope model is fitting your data quite well. Thus, I have confidence in the estimated uniqueness and heterozygosity. I don't believe that the reads from plants are significantly affecting the model.

Do you have any other information for your expected genome size and/or heterozygosity for this organism? Thanks.

wqshi commented 1 year ago

Thanks for the quick reply!

We don't have direct data related to this organism. We assembled an Nanopore genome, which is 1.34Gb. The Busco completeness is 95%, and mapping rate of short illumina reads and genome coverage are both > 95%.

Other information is from other species in the same genus. The flow cytometry results of two species is 1.1-1.3G. And the het rate of one species is < 3%. The expected repetitive rate in the family would be <70%.

I'm wondering how to explain the discrepancy between the estimated size and assembled size? Which one is true?

Wenqiang

wqshi commented 1 year ago

FYI, the busco of the assembly is C:94.8%[S:92.6%,D:2.2%],F:2.4%,M:2.8%,n:1013

Some threads discussed the "suspicious rise in the kmer coverage for very high frequency kmers" (https://github.com/schatzlab/genomescope/issues/100). In our results, there is also a rise in the fouth plot. Will this be a issue?

image

Wenqiang