navinlabcode / copykat

Other
193 stars 53 forks source link

Question regarding "GMM definition mode" #100

Open sjspielman opened 10 months ago

sjspielman commented 10 months ago

Hi copyKat maintainers, thanks for this package! I'm posting this issue to learn more about one aspect of this method described in the publication:

The cluster with minimal estimated variance is defined as the ‘confident diploid cells’ by following a strict classification criterion. Potential misclassifications may occur when the data have only a few normal cells or when the tumor cells have near-diploid genomes with limited copy number aberration (CNA) events. In this case, CopyKAT provides a ‘GMM definition’ mode to identify the diploid normal cells one by one, where a mixture of three Gaussian models of gene expression in single cells is assumed to represent genomic gains, losses and neutral states. A single cell is then defined as a confident diploid cell when genes in neutral states account for at least 99% of the expressed genes.

I am hoping to use copyKat on some pediatric scRNA-seq data, which has far fewer aberrations compared to an adult cancer sample. Since I expect misclassifications in pediatric data, I was hoping to understand how to specify the "GMM definition mode" referenced in this paragraph. But, I don't see anything about this setting in the main copykat() function. Is this mode something that gets automatically applied in the package depending on certain internal results, or is there something else I should specify when using copyKat to invoke this mode? I've already seen that using a correlational distance measure is probably preferable to euclidian for my circumstances, so now just looking for other ways like this I can help the copyKat algorithm work with my pediatric data.

Thanks very much for any advice here!