Open dannykwells opened 1 year ago
Hi @dannykwells, I have not encountered this error. It looks as though the model is not being trained on the GPU, could you check that your CUDA is actually working?
After some investigation, I suspect this might have been fixed with this #152 PR. Could you try to install the repo by cloning it, rather than using pip? That way you should have the latest fixes. Let me know if this helps.
pip install git+https://github.com/theislab/scarches
should also work.
i encounter the same problem, even with the new installation using the github rep as suggested by @cdedonno , i removed sparsity to see if something changes, unfotunately not. any help would be appreciated.
Ok, could any of you provide a minimal example that I could use to reproduce the issue and investigate it? And also your computing environment specifications? (I think torch and cuda versions should suffice)
thanks for the prompt response, i am re-installing and will make a re-run just to confirm and avoid a wild-goose chase! :)
Hey Carlo
I installed scarches right now using : pip install git+https://github.com/theislab/scarches
torch.version
'1.13.1+cu116'
torch.version.cuda
'11.6'
I followed the scpoli tutorial from the docs as it is for importing modules etc and for other parts too with some data specific changes. The code is attached, can i send you the data somehow it is around 0.5GB
at the classify step I get
Traceback (most recent call last):
File "
I think I might have found the issue, but since I can not reproduce your bug on my machine, can you please check if PR #172 fixes your bug? You'd need to either clone the repo and checkout to the scpoli/device_bug
branch or reinstall scarches using this command: pip install git+https://github.com/theislab/scarches.git@scpoli/device_bug
.
Since it was just merged into master, you could also just update the package.
thanks a million, I will retry the stuff.. appreciate your really prompt replies :)
Thanks @cdedonno - this is great. We will give it a shot soon and report back.
hey Carlo, @cdedonno an uninstall followed by re-install using the git link you sent, works! Thanks a lot !
Hi Carlo,
Unfortunately, the error is still there. I think I have narrowed it down:
>>> scpoli_model.train(
... n_epochs=50,
... pretraining_epochs=51,
... early_stopping_kwargs=early_stopping_kwargs,
... eta=5,
... )
|████████████████████| 100.0% - val_loss: 1040.7640380859 - val_trvae_loss: 1040.7640380859
>>> scpoli_model.train(
... n_epochs=50,
... pretraining_epochs=49,
... early_stopping_kwargs=early_stopping_kwargs,
... eta=5,
... )
|███████████████████-| 98.0% - val_loss: 1049.6892264230 - val_trvae_loss: 1049.6892264230RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
>>> scpoli_model.train(
... n_epochs=50,
... early_stopping_kwargs=early_stopping_kwargs,
... eta=5,
... )
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
Looking at the code here:
if self.epoch == self.pretraining_epochs:
self.initialize_prototypes()
if (
0 in self.train_data.labeled_vector.unique().tolist()
or self.model.unknown_ct_names is not None
):
self.prototype_optim = torch.optim.Adam(
params=self.prototypes_unlabeled,
lr=lr,
eps=eps,
weight_decay=self.weight_decay,
)
I wonder if in torch.optim.Adam, it is trying to access self.prototypes_unlabeled on the cpu, but it was on the gpu originally so it can't be found? Any thoughts?
Hi @dannykwells, bummer that the last PR did not solve the issue on your end. I still can't reproduce the bug on my machine, but I will investigate further. Does the traceback you get point to a specific line in the code?
@cdedonno - the traceback does not, but as I mentioned above, I think it is happening at line 370 of scpoli/trainer.py My sense is, as you are transitioning from pretraining to training, the coda thinks the tensor is on the cpu when in fact it is on the gpu.
Could you show me the code you use to instantiate the model? Do you have partially labeled data? Cause in a standard workflow, during reference building the condition to go through line 370 in the trainer should not be met.
Hi @cdedonno here is the entirety of the code - it is from the tutorial on scpoli:
import os
import torch
import numpy as np
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity
from scarches.dataset.trvae.data_handling import remove_sparsity
from scarches.models.scpoli import scPoli
import warnings
warnings.filterwarnings('ignore')
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))
plt.rcParams['figure.dpi'] = 200
plt.rcParams['figure.figsize'] = (4, 4)
adata = sc.read('test-data/pancreas (1).h5ad')
adata
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['study', 'cell_type'], wspace=0.5)
early_stopping_kwargs = {
"early_stopping_metric": "val_prototype_loss",
"mode": "min",
"threshold": 0,
"patience": 20,
"reduce_lr": True,
"lr_patience": 13,
"lr_factor": 0.1,
}
condition_key = 'study'
cell_type_key = ['cell_type']
reference = [
'inDrop1',
'inDrop2',
'inDrop3',
'inDrop4',
'fluidigmc1',
'smartseq2',
'smarter'
]
query = ['celseq', 'celseq2']
adata.obs['query'] = adata.obs[condition_key].isin(query)
adata.obs['query'] = adata.obs['query'].astype('category')
source_adata = adata[adata.obs.study.isin(reference)].copy()
source_adata = source_adata[~source_adata.obs.cell_type.str.contains('alpha')].copy()
target_adata = adata[adata.obs.study.isin(query)].copy()
source_adata, target_adata
scpoli_model = scPoli(
adata=source_adata,
condition_key=condition_key,
cell_type_keys=cell_type_key,
embedding_dim=3,
)
scpoli_model.train(
n_epochs=50,
pretraining_epochs=49,
early_stopping_kwargs=early_stopping_kwargs,
eta=5,
)
Thanks, I thought you were working on an own dataset. I will look into this early next week, I am sorry for the inconvenience.
No worries. Really appreciate all the help.
@dannykwells I am sorry I have not been able to look into this, I was wondering if maybe you figured it out? I have been performing many analyses using the model in the past days, using GPUs, and I have never encountered the error you mentioned.
I am running into this error too, when I try to predict cell types for the query data. This is the error message I get : ----> 1 results_dict = scpoli_query.classify( 2 query.X, 3 query.obs['author'] 4 )
File /nfs/turbo/umms-ukarvind/vravik/scarches/lib/python3.9/site-packages/scarches/models/scpoli/scpoli_model.py:389, in scPoli.classify(self, x, c, prototype, get_prob, log_distance) 380 pred, prob, weighted_distance = self.model.classify( 381 x[batch, :].to(device), 382 prototype=prototype, (...) 385 log_distance=log_distance, 386 ) 387 else: # default routine, classify cell by cell 388 pred, prob, weighted_distance = self.model.classify( --> 389 x[batch, :].to(device), 390 c[batch].to(device), 391 prototype=prototype, 392 classes_list=prototypes_idx, 393 get_prob=get_prob, 394 log_distance=log_distance, 395 ) 396 preds += [pred.cpu().detach()] 397 uncert += [prob.cpu().detach()]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
I've been running into the same error and interestingly, for me, classifying straight after load_query_data works. If train is called after loading query data the problem starts occuring.
A little off topic but maybe someone can help me still: What's the rationale for running train after loading query data in the tutorial? Isn't the entire point to predict on previously unseen data?
Hi @chbeltz, thanks for reporting this. I still have not been able to reproduce this issue on my machine. I will try to look more into this in the coming weeks.
To answer your second question. During training on query data, only the new condition embeddings are learned, and the model is trained as a purely unsupervised model (assuming there are no cell type labels available in the query). Without this training step the condition embeddings for the new query conditions will be those obtained with a random initalization. I hope this answers your question.
Hi @cdedonno - we are running into the below error when we try to run the tutorial on an AWS GPU:
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
Full traceback:
|████████████████----| 80.0% - val_loss: 1066.5160086496 - val_trvae_loss: 1066.5160086496RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
Have you seen such an error before? Do you know how we might address it?