marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

how can I get spectra-cn reads only plot #67

Closed olechnwin closed 2 years ago

olechnwin commented 2 years ago

Hello, Thanks for creating this very useful tool. I was expecting to get the spectra-cn reads only plot like figure 1a in your paper but I can't find it in my output folder. For your convenience, I'm pasting the plot from the paper below so you know which plot I'm referring to.

image

I ran meryl and merqury using the following code:

meryl k=21 threads=30 count output A673_HiFi_CCS.4-cells.meryl A673_HiFi_CCS.4-cells.fastq.gz
/opt/miniconda3/envs/local_merqury_env/share/merqury/merqury.sh A673_HiFi_CCS.4-cells.meryl hifiasm_bionano_salsa_1.fasta hifiasm_bionano_salsa_2.fasta hifiasm_bionano_salsa

How do I generate the spectra-cn reads only plot? If I do $MERQURY/merqury.sh -h I got 2020-01-29. Thank you in advance for your help. edit: forgot to mention I have the spectra-cn for each assembly and for the combined assembly but not the reads only plot.

arangrhie commented 2 years ago

Hello @olechnwin !

The read-only plot is not generated by default. The equivalent output of Merqury is the .fill.png version of the spectra-cn plot if you are interested in quickly seeing the entire kmer frequency distribution.

The plot in the paper figure can be made using $MERQURY/plot/plot_spectra_cn.R. I made the custom plotting a little easier by making the histograms with a desired name.

Try this:

$MERQURY/plot/to_hist_for_plotting.sh reads.meryl read-only > reads.hist
Rscript $MERQURY/plot/plot_spectra_cn.R -f reads.hist -o test -t fill

The XY axis max value to plot can be adjusted with -m and -n.

For more options / help messages, try

$MERQURY/plot/to_hist_for_plotting.sh
Rscript $MERQURY/plot/plot_spectra_cn.R --help
olechnwin commented 2 years ago

@arangrhie, Thank you so much for your super quick reply and for the code. Super helpful! I thought from reading merqury paper you need to see this plot so you can figure out the kmers that are 1-copy and 2-copy and make sure the same kmers are in spectra-cn plot. But, maybe not?

arangrhie commented 2 years ago

Thanks for reading the paper very carefully :)

Yes, it is always recommended at best to first check the kmer spectrum of the reads. I usually do this with GenomeScope2, once the kmer db is built.

meryl histogram reads.meryl > reads_gs.hist

and use the histogram for genomescope2.0/genomescope.R.

I haven't put much effort to generate the reads-only plot into Merqury, as GenomeScope was generating these plots for me + other informative stats (estimated genome size, heterozygosity, and repeats etc.).

olechnwin commented 2 years ago

Thank you so much!! Really appreciate your detailed replies and help. You made it easy to use your tools :)

olechnwin commented 2 years ago

I hope you don't mind me posting this question here. I was able to get the reads-only kmers, but only see 1 peak instead of 2 peaks that I expected since this is a diploid human cancer cell line. image Here is the spectra-cn generated for the two haplotypes assembled genome: image I am not sure what to make of this. The haplotype specific kmers seem to be much lower than expected. Is this because I am using HiFi reads which I know you recommended to use short reads instead?

arangrhie commented 2 years ago

Human genomes are relatively highly homozygous. The NA12878 example (Fig. 3d) in Merqury paper looks similar to yours. It is just a bit more difficult to see the 1-copy peak in your plot because of the lower coverage, because the erroneous k-mer portion bleeds into the true 1-copy peak. You could compare your GenomeScope heterozygosity estimates to my NA12878 set, which was estimated at 0.12%.

The assembly spectra-cn plot looks good to me. The result looks actually bit surprising, I guess your cancer cell line is relatively stable with little anuploidy?

It is best to use Illumina reads for evaluation, or using a hybrid approach, as the hifi specific errors won't get captured when evaluating hifi assemblies with hifi k-mers. If I don't have them, I'd just go with what I have.

olechnwin commented 2 years ago

@arangrhie,

Sorry for my late reply. I somehow didn't press "comment" to submit this. Thank you so much for your insights and suggestions.

Yes. My cancer cell line is known to be mostly diploid, relatively quiet with few mutations. Why did you say the result a bit surprising, what things in the plot that indicate this?

I'll use Illumina reads for evaluation. Many thanks!

arangrhie commented 2 years ago

I was expecting some more peaks coming from possible re-arrangements and anuploidy. Seems like it's not the case for your cell line. Wish all the best for your research!