marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

CN designation clarification #97

Closed jwcodee closed 1 year ago

jwcodee commented 1 year ago

I have a question about CN designation clarification. In the paper, there was a k-mer frequency histogram plot of the read set. The first mode was considered CN 1 and the second was CN 2. Is it as simple as that? Just designation based on relative frequency or is there some model fitting.

arangrhie commented 1 year ago

Hi @jwcodee, the first evidence is to use the expected haploid sequencing coverage, which is usually calculated by the total sequenced bases / expected haploid genome size. For example, 50x will have the CN 2 peak somewhere slightly less than 50x. It's usually clearly visible in diploid genome k-mer frequency histograms. Cases where it is not so clear usually involves polyploidy or sample contamination.

There are several tools that does a more sophisticated ploidy fitting, such as GenomeScope2.

-Arang

jwcodee commented 1 year ago

Right. I understand that. So using your example, k-mers that have a frequency of 75 will be assigned CN 3 and 100 will be assigned CN 4? My other question is how is the boundary decided? Do you simply use the minimum between the two peaks?

arangrhie commented 1 year ago

Merqury is not using CN from the reads except for the false duplication rate calculation, which is a separate script (not included to be run in standard Merqury). I think you are asking about the spectrum-cn plots? Those CNs are the copy numbers found in the assemblies, not in the reads.

jwcodee commented 1 year ago

Ok yes that explains it. I thought the read k-mer CN was used to infer assembly k-mer CN. thanks