scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.22k stars 346 forks source link

Memory error running CellAssign depending on size of marker gene set reference #2935

Closed sjspielman closed 1 month ago

sjspielman commented 2 months ago

I am trying to use CellAssign with GPU to varying success. I am finding that CellAssign is giving memory availability errors (torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.57 GiB. GPU 0 has a total capacity of 14.58 GiB of which 3.35 GiB is free etc.....) based on the size of the marker gene reference file. The reference I have prepared for use is 2274 genes x ~70 cell types. I can only get CellAssign to run without a memory allocation error if I drop the number of genes to ~400. The size of the AnnData itself don't seem to cause this problem, but larger reference files do.

I am wondering if this is a known issue, or you have any recommendations for preparing marker gene reference sets that will not cause this sort of error? Thanks very much! The file whi

I am attaching here a version of the reference gene file that runs for me and one that does not -

I can't directly attach the dataset I am running it with due to sharing restrictions, but the dataset is freely available for download here. It's the SCPCL000001_processed_rna.h5ad file associated with the download for sample SCPCS000001: https://scpca.alexslemonade.org/projects/SCPCP000001. Note that I have encountered the same memory error with other AnnData files, including those up to nearly 1 GB which again run with the smaller marker gene reference but not the full set I hope to use.

Here is the crux of the code I am using to run CellAssign:

import anndata as adata
import pandas as pd
import scvi
from scvi.external import CellAssign

scvi.settings.seed = 2024

# read in marker gene reference - this one runs
ref_matrix = pd.read_csv(
    "brain-reference-runs.tsv", sep="\t", index_col="ensembl_gene_id"
)

# read in anndata
adata = adata.read_h5ad("SCPCL000001_processed_rna.h5ad")

# subset anndata to contain only genes in the reference file
shared_genes = list(set(ref_matrix.index) & set(annotated_adata.var_names))
subset_adata = adata[:, shared_genes].copy()
subset_adata.X = subset_adata.X.tocsr()

# add size factor to subset adata (calculated from full data)
lib_size = adata.X.sum(1)
subset_adata.obs["size_factor"] = lib_size / np.mean(lib_size)

# CellAssign inference
scvi.external.CellAssign.setup_anndata(subset_adata, size_factor_key="size_factor")
model = CellAssign(subset_adata, ref_matrix)
model.train(accelerator="gpu")
predictions = model.predict()

Versions:

1.1.5

I am running in a conda environment built from environment.yml.txt.

Thanks very much for any insights into what might be going on here!

canergen commented 2 months ago

Hi. It says 3.5GB is free on the GPU which is very little. Can you use an empty GPU? A free Colab GPU would be sufficient for your purpose. I assume we could optimize RAM use but this won’t be a priority in the coming months if it works fine on 16GB of VRAM.

canergen commented 2 months ago

I assume it happens during the predict function. If it still fails can you verify it. I don’t plan changing the model but changing inference would be straightforward (and we have two solutions to it).

sjspielman commented 2 months ago

Hi @canergen, thanks for having a look! Indeed, I'm not at all sure why this GPU isn't empty to begin with. I have tried adding torch.cuda.empty_cache() into my code before running CellAssign, but it doesn't make a difference. This is something I will look into locally since I don't imagine it's related to scvi-tools. I will give it a go on Google Colab as well; if it runs there, the issue is definitely with my GPU setup.

But, it does seem that the error is happening earlier at the .train() step, not predict. Here's my full output:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py:293: The number of training batches (3) is smaller than the logging interval Trainer(log_every_n_steps=10). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.

Training:   0%|          | 0/400 [00:00<?, ?it/s]
Epoch 1/400:   0%|          | 0/400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/lightsail-user/gpu-cellassign-testing/OpenScPCA-analysis/analyses/gpu-cellassign-testing/cellassign.py", line 140, in <module>
    model.train(accelerator="gpu") 
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/external/cellassign/_model.py", line 235, in train
    return runner()
           ^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/train/_trainrunner.py", line 98, in __call__
    self.trainer.fit(self.training_plan, self.data_splitter)
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/train/_trainer.py", line 220, in fit
    super().fit(*args, **kwargs)
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 136, in run
    self.advance(data_fetcher)
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 240, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 187, in run
    self._optimizer_step(batch_idx, closure)
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 265, in _optimizer_step
    call._call_lightning_module_hook(
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/core/module.py", line 1291, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/core/optimizer.py", line 151, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 230, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/plugins/precision/precision.py", line 117, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/torch/optim/optimizer.py", line 484, in wrapper
    out = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/torch/optim/optimizer.py", line 89, in _use_grad
    ret = func(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/torch/optim/adam.py", line 205, in step
    loss = closure()
           ^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/plugins/precision/precision.py", line 104, in _wrap_closure
    closure_result = closure()
                     ^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 140, in __call__
    self._result = self.closure(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 126, in closure
    step_output = self._step_fn()
                  ^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 315, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 382, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/train/_trainingplans.py", line 344, in training_step
    _, _, scvi_loss = self.forward(batch, loss_kwargs=self.loss_kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/train/_trainingplans.py", line 278, in forward
    return self.module(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/module/base/_decorators.py", line 32, in auto_transfer_args
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/module/base/_base_module.py", line 203, in forward
    return _generic_forward(
           ^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/module/base/_base_module.py", line 747, in _generic_forward
    generative_outputs = module.generative(**generative_inputs, **generative_kwargs)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/module/base/_decorators.py", line 32, in auto_transfer_args
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lightsail-user/.conda/envs/gpu/lib/python3.11/site-packages/scvi/external/cellassign/_module.py", line 185, in generative
    torch.sum(a * torch.exp(-b * torch.square(mu_ngcb - basis_means)), 3) + LOWER_BOUND
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.57 GiB. GPU 0 has a total capacity of 14.58 GiB of which 3.36 GiB is free. Process 2254 has 247.58 MiB memory in use. Including non-PyTorch memory, this process has 10.98 GiB memory in use. Of the allocated memory 10.84 GiB is allocated by PyTorch, and 18.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
sjspielman commented 2 months ago

Hello, wanted to share a quick update. I have been for now working locally to figure out when the GPU is filling up, running line-by-line in an interactive python session with nvidia-smi -l i running so I can monitor usage.

The GPU indeed appears essentially free until I run the .train() line, at which point it starts ramping up and fills up. This is what I see up through and including the line model = CellAssign(subset_adata, ref_matrix):

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0              26W /  70W |    395MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Then, when I issue model.train(accelerator="gpu"), I see the memory go up to 1407/15360MiB, but then it immediately fails with:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.57 GiB. GPU 0 has a total capacity of 14.58 GiB of which 3.36 GiB is free. Process 2242 has 247.95 MiB memory in use. Including non-PyTorch memory, this process has 10.98 GiB memory in use. Of the allocated memory 10.84 GiB is allocated by PyTorch, and 18.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

...in spite of what looks like a decent amount of available memory as seen in the nvidia-smi output -

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0              26W /  70W |   1407MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2242      C   /usr/lib/x86_64-linux-gnu/dcv/dcvagent      247MiB |
|    0   N/A  N/A      7456      C   python3                                    1154MiB |
+---------------------------------------------------------------------------------------+
canergen commented 1 month ago

Hi, can you pass a smaller batch_size like model.train(accelerator="gpu", batch_size=32). You can then increase it to maximize usage of GPU RAM. The current default is 1024. However, it creates several tensors of n_cellsXn_genesXn_celltypes and this is apparently too large in your case.

sjspielman commented 1 month ago

Hi, can you pass a smaller batch_size like model.train(accelerator="gpu", batch_size=32).

Thank you @canergen, this worked for me to get CellAssign running with my full marker genes set (and if not the most efficiently, at least more efficiently than CPU). Appreciate the help!

canergen commented 1 month ago

Great, 128 will likely still work but you can see your memory use (consumption will be roughly linear with the batch side).

sjspielman commented 1 month ago

Great, 128 will likely still work but you can see your memory use (consumption will be roughly linear with the batch side).

💯 , I was able to get up to 128, but not >=256