schatzlab / genomescope

Fast genome analysis from unassembled short reads
Apache License 2.0
249 stars 56 forks source link

Please help me to understand output #135

Open OrsonMM opened 1 month ago

OrsonMM commented 1 month ago

Dear team Genomescope2.

I am using the tools to understand why I have a bad novo assembly, with more than 1M of contigs and small.

I have illumina sequence of Cavia Porcellus (guinea pig), genome size in GENBANK is around 3 GB. is diploid (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_034190915.1/)

I had this output in genomescope2:

GENOME_01 :

Rscript genomescope02.R -i $1.histo -k $2 -p 2 -n $1 -o $1_$2_dir
GenomeScope version 2.0
input file = genome_01.histo
output directory = genome_01_dir
p = 2
k = 21
name prefix = genome_01

property                      min               max
Homozygous (aa)               60.4219%          100%
Heterozygous (ab)             0%                39.5781%
Genome Haploid Length         515,546,895 bp    531,285,372 bp
Genome Repeat Length          254,706,772 bp    262,482,392 bp
Genome Unique Length          260,840,123 bp    268,802,980 bp
Model Fit                     48.8344%          74.6587%
Read Error Rate               2.91981%          2.91981%

SA42913_transformed_log_plot

I interpret, that Genomescope02 gives me that I have a sequencing that represents 531,285,372 bp of genomic size. In addition, the sequencing is at ~13x. And that I have a high error rate in my reads of almost 3%.

If the total genome size is 3GB in genbank, that means that having 531 Mpb, with 10x coverage, if I would have good information to get a not so bad denovo assembly.

Please, any suggestion.

mschatz commented 3 weeks ago

Thanks for writing. Your kmer profile shows a bad fit between the kmer counts you have in the reads (in blue) and the GenomeScope model (in black). As such, I wouldnt trust the genome size estimate or the predicted level of heterozygosity. As you mention the major challenge here is the low coverage: focusing on the blue curve (from your reads) you have a peak at about 10x coverage which is too low for a reliable GenomeScope estimate. This will also lead to a poor assembly. My strong recommendation would be to seek out more coverage. I would also strongly recommend PacBio HiFi reads over short reads for a mammalian de novo assembly.

Good luck!

Mike

On Mon, Aug 12, 2024 at 11:08 AM Orson @.***> wrote:

Dear team Genomescope2.

I am using the tools to understand why I have a bad novo assembly, with more than 1M of contigs and small.

I have illumina sequence of Cavia Porcellus (guinea pig), genome size in GENBANK is around 3 GB. ( https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_034190915.1/)

I had this output in genomescope2:

GENOME_01 :

Rscript genomescope02.R -i $1.histo -k $2 -p 2 -n $1 -o $1_$2_dir

GenomeScope version 2.0 input file = genome_01.histo output directory = genome_01_dir p = 2 k = 21 name prefix = genome_01

property min max Homozygous (aa) 60.4219% 100% Heterozygous (ab) 0% 39.5781% Genome Haploid Length 515,546,895 bp 531,285,372 bp Genome Repeat Length 254,706,772 bp 262,482,392 bp Genome Unique Length 260,840,123 bp 268,802,980 bp Model Fit 48.8344% 74.6587% Read Error Rate 2.91981% 2.91981%

SA42913_transformed_log_plot.png (view on web) https://github.com/user-attachments/assets/e4a6e36a-530a-4e71-bd73-d8fac6cb1f98

I interpret, that Genomescope02 gives me that I have a sequencing that represents 531,285,372 bp of genomic size. In addition, the sequencing is at ~13x. And that I have a high error rate in my reads of almost 3%.

If the total genome size is 3GB in genbank, that means that having 531 Mpb, with 10x coverage, if I would have good information to get a not so bad denovo assembly.

Please, any suggestion.

— Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABP342N7UEXWYSVPHFIFGLZRDFYBAVCNFSM6AAAAABMMMCSPOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ3DCMRVG43DINQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>