Similarity values are not between 0 and 1

hemberg-lab / scmap

A tool for unsupervised projection of single cell RNA-seq data

http://bioconductor.org/packages/scmap

GNU General Public License v3.0

90 stars 11 forks source link

Similarity values are not between 0 and 1 #28

Open hayfre opened 3 years ago

hayfre commented 3 years ago

Hi there, I have been using scmap cell2cluster to annotate both human and mouse data sets. The cell type annotation results that we get seem to make sense but the similarity values are not in the expected range of 0 to 1. This seems to be a bug in scmap-cell. When running a test where the reference dataset cells are split into test and train data, the values are in the correct range for all 3 settings (cluster, cell, cell2cluster). However, when applying our own query data the problem occurs with cell (but not cluster) and is then propagated to cell2cluster. We have experienced this issue with 2 unique datasets using 3 different reference datasets. I would appreciate your help to address this issue!

For reference: scmap version 1.8.0 R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

sagnikbanerjee15 commented 3 years ago

Hello,

I am facing the exact same problem. @hayfre please let me know if you have been able to solve it.

Thanks.

hayfre commented 3 years ago

Hi @sagnikbanerjee15, No, I have unfortunately not had time to look into this further.

sagnikbanerjee15 commented 3 years ago

Hi @hayfre,

I think I have figured out the error. The tools does not seem to have a bug but I found an inconsistency in the gene names of my training data. For some reason, genes one of the reference datasets were denoted as a concatenated string between the gene_id and the gene_name. The similarity scores were greater than 1 for this particular dataset. I intentionally projected the same dataset onto itself and then it returned a value between 0 and 1.

Thank you.

LisaBast commented 3 years ago

Thanks for the hint with inconsisting gene names @sagnikbanerjee15. I was trying to solve this for some time and could finally get rid of values out of the [0,1] interval. In my case the gene name convention was not different between both sce objects but the query data contained some genes that were not in the reference and the other way around. By making sure that the sce objects for reference and query only contain the genes present in both data sets I could solve it.