mars.train reports inaccurate performance scores?

Winbuntu commented 3 years ago

Hi, I am evaluating the performance of MARS, but got a question about how MARS compute the performance score in evaluation mode:

mars = MARS(n_clusters, params, labeled_exp, unlabeled_exp, pretrain_data)
adata, landmarks, scores = mars.train(evaluation_mode=True)

here scores are the performance score. And here is the code computing the scores: https://github.com/snap-stanford/mars/blob/master/model/metrics.py

I noticed that in the above mentioned code, when computing the score, MARS first uses hungarian algorithm to match the predicted labels to the original labels. If my understanding is correct, this hungarian match step finds the best correspondence between original cell type labels and predicted labels, such that the number of correctly predicted labels can be maximized.

Having this hungarian match this might be okay for a quick evaluation, because we could let Hungarian algorithm to find the optimal match between original cell type and the predicted labels, otherwise we have to run mars.name_cell_types to find the match by ourselves. However, in some cases, having this hungarian match could result in inaccurate performance score, and even overestimate the performance in some cases. I am attaching a working example (a zipped jupyter notebook compute_score.zip) here FYI. You could see how hungarian match influences performance evaluation.

Therefore I believe the way that MARS evaluating its own performance might be inaccurate, and could report inaccurate performance scores, if you are running MARS in the evaluation mode. But I noticed this score is used in many places, including the online tutorials (kolod_pollen_bench.ipynb and cellbench.ipynb), even in the leave-one-tissue-out experiment that reported in the original paper (https://github.com/snap-stanford/mars/blob/master/main_TM.py), suggesting the authors might not only used it as a quick evaluation, but taken it as a formal evaluation of the model performance.

Please correct me if my understanding about hungarian match/score calculation is wrong :-) ! Also, could the authors share how did they evaluate the MARS performance in the leave-one-tissue-out experiment, if they were using a more accurate way to calculate performance scores, if it is not the same as shown in here: https://github.com/snap-stanford/mars/blob/master/main_TM.py

mbrbic commented 3 years ago

To report classification metrics on clustering/unsupervised tasks, the optimal assignment problem needs to be solved first. This is a combinatorial optimization problem which Hungarian algorithm solves in polynomial time. This is the most standard approach for evaluating performance of clustering algorithms (e.g. http://proceedings.mlr.press/v48/xieb16.pdf). We report 5 different evaluation metrics and evaluation is definitely not inaccurate.

Winbuntu commented 3 years ago

Thanks for your reply. But now I have another question: Hungarian algorithm does not allow us to label clusters with cell types for a completely unseen data (data that we do not have any cell type information available/ or have no information about portion of cell types in this dataset), right? Because Hungarian algorithm can only achieve best mapping/assignment between clustering labels if we have the ground truth (but ground truth is unavailable for a real world unlabeled data).

mbrbic commented 3 years ago

Yes, it is used just for evaluation when you have ground truth annotations. If you do not have annotations, you should run MARS with evaluation_mode=False. In that case, I suggest you find differentially expressed genes for the cluster you obtain with MARS and check whether they make sense/agree with some marker genes.

snap-stanford / mars

mars.train reports inaccurate performance scores? #24