immunogenomics / symphony

Efficient and precise single-cell reference atlas mapping with Symphony
GNU General Public License v3.0
95 stars 22 forks source link

Choice of the value of the k parameter in the knnPredict function #12

Closed GAgafencu closed 3 years ago

GAgafencu commented 3 years ago

Thank you for developing Symphony and for the vignettes for this tool. I would like to ask how you and the team developing Symphony have arrived at the k=5 value used in the preprint and in the vignettes for the knnPredict function? I know that this is not a question that would necessarily fall under the issue category, but I think it would be helpful for the community of Symphony users to understand the methodology of how the value for k used in the knnPredict function was set and how to choose the value of this parameter for their use cases.

Thank you for your time and help, Grigore

joycekang commented 3 years ago

Hi Grigore,

That is a great question. We tested various values of k (k=5, 10, 30, 50) and found that the prediction accuracy was relatively stable across choices of k for the fetal liver example (result will be shown in an updated version of the manuscript). For k-NN prediction, we would recommend that users alter the k parameter so that it is ideally no larger than the number of cells in the rarest cell type of the reference. For example, if the reference contains only 10 cells of a rare cell type, then we recommend the user set k no higher than 10, to ensure that rare cell types in the reference have the chance of being predicted given a majority vote k-NN classifier. Importantly, we'd like to emphasize that k-NN is one option (and perhaps the most simple/intuitive) for downstream inference but not the only option. Multiple types of classifiers (e.g. SVM, multinomial logistic regression) can be trained to predict cell types (or other annotations) using the harmonized PCs as input, trained on the reference cells.

Hopefully that helps! Joyce

GAgafencu commented 3 years ago

Hi Joyce, Thanks a lot for the very helpful reply. I was thinking of increasing the k to the value threshold that you mentioned, but I'm quite new to this field, so it is good to see that my rationale was not unrealistic. Thanks again for the reply and apologies for the delay in replying.

Best wishes, Grigore