openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
314 stars 78 forks source link

[spatial_deconvolution] destVI stops extremely early #821

Closed canergen closed 1 year ago

canergen commented 1 year ago

Thanks for putting together the awesome package with the well working docker containers. I saw the results of destVI being impaired also in the simulation developed by us. I checked your code and destVI stops extremely early (130 of 2500 epochs). We don’t suggest stopping that early. R2 was up to 0.87 after removing early_stopping and increasing the number of epochs to 10k for exhaustive training.

We will make a PR the coming days to remove early stopping. There are two things I want to verify: First the fix should also improve performance on the other datasets and I would like to run destVI on all files. Is there a way to do from CLI instead of doing it one by one dataset?

Do you take user supplied trained adata to update your result page or do you rerun the pipeline anyhow (second one looks preferable)?

I would like to verify that early_stopping works in the CondSCVI model. My initial experiment suggests that it might work (actually it doesn’t early stop over 300 epochs). I would like to increase this to 1000 epochs with early stopping.

LuckyMD commented 1 year ago

Hi @cane11,

Thanks for your input here. I'm adding @giovp here, who is the task leader for spatial decomposition.

To answer at least one of your questions, we do rerun all of the results regularly. Currently we are still pre-publication so are running more frequently. If you can submit this PR still this week, we will include it in the next rerun that will (hopefully) happen on Friday (pending other fixes).

Questions on the CLI are best answered by @scottgigante-immunai.

scottgigante-immunai commented 1 year ago

Re: CLI, unfortunately we don't currently have a straightforward way of running only some of the full benchmark. I would recommend something like the following: (update: I tested this and it works inside the openproblems-python-pytorch docker image)

cd openproblems
docker run -it -v /tmp:/tmp -v $(pwd):/usr/src/singlecellopenproblems singlecellopenproblems/openproblems-python-pytorch

and then inside docker:

import openproblems
task = openproblems.tasks.spatial_decomposition
adatas = {}

# load datasets
for dataset in task.DATASETS:
    adatas[dataset.metadata["dataset_name"]] = dataset()

# run destvi
adatas = {name: task.methods.destvi(adata) for name, adata in adatas.items()}

# run metrics
for dataset_name, adata in adatas.items():
    for metric in task.METRICS:
        print(dataset_name, metric.metadata["metric_name"], metric(adata))
scottgigante-immunai commented 1 year ago

Re: early stopping, we will definitely at the very least need early stopping enabled in test mode. Ideally imho early stopping would also be enabled in the full benchmark, but with a much longer delay to enable proper convergence before stopping.

giovp commented 1 year ago

hi @cane11 , it could be that the 300 for sc and 2500 for spatial methods were taken from the tutorial indeed potentially not optimal across datasets please go ahead and change it as you think it'd be best.

canergen commented 1 year ago

I generally think 2500 and 300 is still fine (except for very small datasets that were used in some benchmarks with ~100 spots), why I have written the tutorial this way. I will do some comparison today with 5000 and 2500 epochs and CondSCVI with early stopping. I just wanted to have exhaustive training first whether our model has issues or it’s the training procedure. It is train reconstruction loss and the KL term is quite important for destVI (with warmup this increases initially). Validation elbo_loss might be better. However, as we do not use full amortization as default there is no possibility of validation cells. As we don’t have validation cells, we never benchmarked early_stopping and never argued about runtime with early stopping. @adamgayoso: I think even with longer delay it will use the model with lowest reconstruction loss so it might not change anything. I’m not sure it finds better models in this metric over time or whether it is early overfitting before the KL term dominates. @scottgigante-immunai I am not sure I follow your argument that you need early stopping. My internal benchmark is that we don’t take longer than Cell2location. Do you have another requirement for runtime? This can very well guide the number of epochs. I think that it is best to PR tomorrow and have you run it on Friday on all datasets and for us to look into datasets with low performance manually and see what’s going wrong there. Is there disagreement on this approach?

LuckyMD commented 1 year ago

Regarding your question on @scottgigante-immunai 's comment on early stopping, we have 2 ways we run each method: test phase and full runs. In test phase we just check the the method implementation is working, which happens via github actions. For this we need early stopping with an if test statement. For the full run, where the results are posted to the website, this is not needed.

scottgigante-immunai commented 1 year ago

I think that it is best to PR tomorrow and have you run it on Friday on all datasets and for us to look into datasets with low performance manually and see what’s going wrong there. Is there disagreement on this approach?

If you could test it manually that would be preferable as running the full benchmark costs quite a substantial amount of money in compute costs -- once we run the full benchmark this week I can't guarantee when it will next be run. I am happy to assist with getting you up and running as needed.

canergen commented 1 year ago

Sounds reasonable. I will give as many datasets as possible a try today and create a PR later today.

For early stopping, it is dataset dependent as in our tutorial with lymph node early_stopping would work. I don't think I can give a fix on this today. The most reliable would most likely be whether the cell_type proportions change over time. The PR will contain early_stopping for condSCVI where I feel very comfortable that it is reliable (using _elbovalidation).

Screenshot 2023-02-09 at 8 42 57 AM
scottgigante-immunai commented 1 year ago

I think the trick would be setting the number of epochs after which early stopping applies. Currently it looks like it applies after 45 epochs which might be a little quick.

scottgigante-immunai commented 1 year ago

Also to add some additional info here, running training locally it looks like early stopping is consistently not kicking in on sc data and is kicking in (quite early, like 200-400 epochs) on spatial data (which is also much faster to train) so perhaps increasing the threshold there is all that's needed.

canergen commented 1 year ago

@giovp As far as I can tell from the created data, deconvolution uses a 100 generated spots in the spatial assay. Do you have any reasoning for the small number of spots? We generally had assays with more than 1000 spots in mind when developing the tool and are challenged by the order of difference in dataset sizes used for benchmarking. https://scvi-tools.org/blog/destvi-batchsize never found it's way into scvi as we decided that it's a fix for the sole purpose to deal with benchmarks. I implemented an adapted version of the proposed solution in the PR (it's not really a solution though). Increasing the kl-weight in the PR is another way to stabilize training.

scottgigante-immunai commented 1 year ago

Closed in #826