Closed jwcodee closed 1 year ago
Hi @jwcodee, the first evidence is to use the expected haploid sequencing coverage, which is usually calculated by the total sequenced bases / expected haploid genome size. For example, 50x will have the CN 2 peak somewhere slightly less than 50x. It's usually clearly visible in diploid genome k-mer frequency histograms. Cases where it is not so clear usually involves polyploidy or sample contamination.
There are several tools that does a more sophisticated ploidy fitting, such as GenomeScope2.
-Arang
Right. I understand that. So using your example, k-mers that have a frequency of 75 will be assigned CN 3 and 100 will be assigned CN 4? My other question is how is the boundary decided? Do you simply use the minimum between the two peaks?
Merqury is not using CN from the reads except for the false duplication rate calculation, which is a separate script (not included to be run in standard Merqury). I think you are asking about the spectrum-cn plots? Those CNs are the copy numbers found in the assemblies, not in the reads.
Ok yes that explains it. I thought the read k-mer CN was used to infer assembly k-mer CN. thanks
I have a question about CN designation clarification. In the paper, there was a k-mer frequency histogram plot of the read set. The first mode was considered CN 1 and the second was CN 2. Is it as simple as that? Just designation based on relative frequency or is there some model fitting.