openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
287 stars 76 forks source link

[Label projection] scANVI task sees test cells during training #771

Closed mxposed closed 1 year ago

mxposed commented 1 year ago

In scArches+scANVI, dataset is split into train/test, and only the train part is used to train scVI/scANVI model: https://github.com/openproblems-bio/openproblems/blob/main/openproblems/tasks/label_projection/methods/scvi_tools.py#L102

In contrast, scANVI method trains on all cells: https://github.com/openproblems-bio/openproblems/blob/main/openproblems/tasks/label_projection/methods/scvi_tools.py#L68

If nobody has objections, I will make scANVI method train only on train part of the data, and then get the latent dimensions/predict for test.

cc @LuckyMD @adamgayoso

adamgayoso commented 1 year ago

scANVI actually doesn't see the test set labels. The way this is implemented it's a semi-supervised method

see:

https://github.com/openproblems-bio/openproblems/blob/3d8964a6c02496c0c604f0b1ddadc40589ca43a8/openproblems/tasks/label_projection/methods/scvi_tools.py#L66

https://github.com/openproblems-bio/openproblems/blob/3d8964a6c02496c0c604f0b1ddadc40589ca43a8/openproblems/tasks/label_projection/methods/scvi_tools.py#L81

mxposed commented 1 year ago

I agree, it doesn't see the labels, but it sees the cells. Does my concern make sense?

mxposed commented 1 year ago

There are 2 modes of operation for label projection task, I guess:

  1. The model can be pre-trained without test data, and then applied to test data;
  2. The model cannot be pre-trained without test data, and needs to be trained with train+test data.

I assumed this task was covering only the 1st type, but it doesn't have to. However, I'd like to make clear distinction between the two types, and I think that current scANVI method implementation falls into the second category

adamgayoso commented 1 year ago

I agree we should separate these things.

For example, de novo integration with scanorama on all data and then training a classifier on training embeddings would fall into (2) in this case.

scottgigante-immunai commented 1 year ago

As far as I'm concerned the models have to be able to use the expression data for the test cells to predict their labels. Whether you use these for semi-supervised or purely supervised is up to you -- I can't imagine a setting where you wouldn't be able to do it semi-supervised.

LuckyMD commented 1 year ago

Agree with @scottgigante-immunai, I actually think this is fine... the benchmark is an evaluation of how these tools would be used from the user perspective. When using scANVI alone for new data annotation, you need to train on both train+test to annotate a new dataset... the only alternative i can think of is using a forward pass on the most proximal batch, which is a non-realistic modus operandi for not seeing the test batch in training.