openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.
https://anomalib.readthedocs.io/en/latest/
Apache License 2.0
3.73k stars 666 forks source link

[Bug]: RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` #762

Closed monkeycc closed 1 year ago

monkeycc commented 1 year ago

Describe the bug

python tools/train.py --model patchcore
2022-12-05 20:33:20,501 - anomalib.data - INFO - Loading the datamodule
2022-12-05 20:33:20,502 - anomalib.pre_processing.pre_process - WARNING - Transform configs has not been provided. Images will be normalized using ImageNet statistics.
2022-12-05 20:33:20,502 - anomalib.pre_processing.pre_process - WARNING - Transform configs has not been provided. Images will be normalized using ImageNet statistics.
2022-12-05 20:33:20,502 - anomalib.models - INFO - Loading the model.
2022-12-05 20:33:20,507 - torch.distributed.nn.jit.instantiator - INFO - Created a temporary directory at /tmp/tmp5ojcalbj
2022-12-05 20:33:20,507 - torch.distributed.nn.jit.instantiator - INFO - Writing /tmp/tmp5ojcalbj/_remote_module_non_scriptable.py
2022-12-05 20:33:20,514 - anomalib.models.components.base.anomaly_module - INFO - Initializing PatchcoreLightning model.
/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `PrecisionRecallCurve` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
2022-12-05 20:33:21,594 - timm.models.helpers - INFO - Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/wide_resnet50_racm-8234f177.pth)
2022-12-05 20:33:24,078 - anomalib.utils.loggers - INFO - Loading the experiment logger(s)
2022-12-05 20:33:24,078 - anomalib.utils.callbacks - INFO - Loading the callbacks
/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/anomalib/utils/callbacks/__init__.py:143: UserWarning: Export option: None not found. Defaulting to no model export
  warnings.warn(f"Export option: {config.optimization.export_mode} not found. Defaulting to no model export")
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - GPU available: True, used: True
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - TPU available: False, using: 0 TPU cores
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - IPU available: False, using: 0 IPUs
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - HPU available: False, using: 0 HPUs
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
2022-12-05 20:33:24,082 - pytorch_lightning.utilities.rank_zero - INFO - `Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
2022-12-05 20:33:24,082 - anomalib - INFO - Training the model.
2022-12-05 20:33:24,086 - anomalib.data.mvtec - INFO - Found the dataset.
2022-12-05 20:33:24,087 - anomalib.data.mvtec - INFO - Setting up train, validation, test and prediction datasets.
/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `ROC` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
2022-12-05 20:33:24,661 - pytorch_lightning.accelerators.gpu - INFO - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py:184: UserWarning: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer
  "`LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer",
2022-12-05 20:33:24,665 - pytorch_lightning.callbacks.model_summary - INFO - 
  | Name                  | Type                     | Params
-------------------------------------------------------------------
0 | image_threshold       | AnomalyScoreThreshold    | 0     
1 | pixel_threshold       | AnomalyScoreThreshold    | 0     
2 | model                 | PatchcoreModel           | 24.9 M
3 | image_metrics         | AnomalibMetricCollection | 0     
4 | pixel_metrics         | AnomalibMetricCollection | 0     
5 | normalization_metrics | MinMax                   | 0     
-------------------------------------------------------------------
24.9 M    Trainable params
0         Non-trainable params
24.9 M    Total params
99.450    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                                                                                  | 0/10 [00:00<?, ?it/s]/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:137: UserWarning: `training_step` returned `None`. If this was on purpose, ignore this warning...
  self.warning_cache.warn("`training_step` returned `None`. If this was on purpose, ignore this warning...")
Epoch 0:  70%|████████████████2022-12-05 20:33:27,516 - anomalib.models.patchcore.lightning_model - INFO - Aggregating the embedding extracted from the training set..46it/s, loss=nan]
2022-12-05 20:33:27,523 - anomalib.models.patchcore.lightning_model - INFO - Applying core-set subsampling to get the embedding.
Traceback (most recent call last):
  File "tools/train.py", line 76, in <module>
    train()
  File "tools/train.py", line 65, in train
    trainer.fit(model=model, datamodule=datamodule)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 205, in run
    self.on_advance_end()
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 255, in on_advance_end
    self._run_validation()
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 311, in _run_validation
    self.val_loop.run()
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 199, in run
    self.on_run_start(*args, **kwargs)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 136, in on_run_start
    self._on_evaluation_start()
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 253, in _on_evaluation_start
    self.trainer._call_lightning_module_hook("on_validation_start", *args, **kwargs)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/anomalib/models/patchcore/lightning_model.py", line 94, in on_validation_start
    self.model.subsample_embedding(embeddings, self.coreset_sampling_ratio)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/anomalib/models/patchcore/torch_model.py", line 142, in subsample_embedding
    coreset = sampler.sample_coreset()
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/anomalib/models/components/sampling/k_center_greedy.py", line 131, in sample_coreset
    idxs = self.select_coreset_idxs(selected_idxs)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/anomalib/models/components/sampling/k_center_greedy.py", line 95, in select_coreset_idxs
    self.features = self.model.transform(self.embedding)
  File "/home/ai/anaconda3/envs/PFM/lib/python3.7/site-packages/anomalib/models/components/dimensionality_reduction/random_projection.py", line 132, in transform
    projected_embedding = embedding @ self.sparse_random_matrix.T.float()
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Epoch 0:  70%|███████   | 7/10 [00:03<00:01,  2.17it/s, loss=nan]                                                    

Dataset

MVTec

Model

PatchCore

Steps to reproduce the behavior

python tools/train.py --model patchcore

OS information

OS information:

Expected behavior

...

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

...

Logs

...

Code of Conduct

samet-akcay commented 1 year ago

@monkeycc, I cannot reproduce this error. Can you confirm whether you still experience this issue? Thanks!