it can train, but it cont test

leolv131 commented 2 years ago

after train , when test it shows the error, how could i solve the problem(I'm trained to use the default parameters of the code.): RuntimeError: CUDA out of memory. Tried to allocate 19.70 GiB (GPU 0; 8.00 GiB total capacity; 301.95 MiB already allocated; 6.18 GiB free; 326.00 MiB reserved in total by PyTorch)

letmejoin commented 2 years ago

@leolv131 I met same issue. Traceback (most recent call last): File "train.py", line 452, in trainer.test(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in test results = self._run(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run self.dispatch() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 793, in dispatch self.accelerator.start_evaluating(self) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 99, in start_evaluating self.training_type_plugin.start_evaluating(trainer) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 148, in start_evaluating self._results = trainer.run_stage() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 804, in run_stage return self.run_evaluate() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in run_evaluate eval_loop_results = self.run_evaluation() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 170, in evaluation_step output = self.trainer.accelerator.test_step(args) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 245, in test_step return self.training_type_plugin.test_step(args) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 164, in test_step return self.lightning_module.test_step(args, kwargs) File "train.py", line 376, in test_step score_patches = knn(torch.from_numpy(embedding_test).cuda())[0].cpu().detach().numpy() File "train.py", line 51, in call return self.predict(x) File "train.py", line 76, in predict dist = distance_matrix(x, self.train_pts, self.p) (1 / self.p) File "train.py", line 35, in distance_matrix dist = torch.pow(x - y, p).sum(2) RuntimeError: CUDA out of memory. Tried to allocate 4.82 GiB (GPU 0; 10.73 GiB total capacity; 5.10 GiB already allocated; 4.47 GiB free; 5.15 GiB reserved in total by PyTorch) It seems like matrix loaded into CUDA resulted the problem, but I don't know how to tackle the problem. @hcw-00 Have you got some advices?

leolv131 commented 2 years ago

@leolv131 I met same issue. Traceback (most recent call last): File "train.py", line 452, in trainer.test(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in test results = self._run(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run self.dispatch() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 793, in dispatch self.accelerator.start_evaluating(self) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 99, in start_evaluating self.training_type_plugin.start_evaluating(trainer) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 148, in start_evaluating self._results = trainer.run_stage() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 804, in run_stage return self.run_evaluate() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in run_evaluate eval_loop_results = self.run_evaluation() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 170, in evaluation_step output = self.trainer.accelerator.test_step(args) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 245, in test_step return self.training_type_plugin.test_step(args) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 164, in test_step return self.lightning_module.test_step(args, kwargs) File "train.py", line 376, in test_step score_patches = knn(torch.from_numpy(embedding_test).cuda())[0].cpu().detach().numpy() File "train.py", line 51, in call return self.predict(x) File "train.py", line 76, in predict dist = distance_matrix(x, self.train_pts, self.p) (1 / self.p) File "train.py", line 35, in distance_matrix dist = torch.pow(x - y, p).sum(2) RuntimeError: CUDA out of memory. Tried to allocate 4.82 GiB (GPU 0; 10.73 GiB total capacity; 5.10 GiB already allocated; 4.47 GiB free; 5.15 GiB reserved in total by PyTorch) It seems like matrix loaded into CUDA resulted the problem, but I don't know how to tackle the problem. @hcw-00 Have you got some advices?

what's your input size? my input size is 224, need 19g gpu

letmejoin commented 2 years ago

@leolv131 I found the solution by set --coreset_sampling_ratio very small, like 0.0001 as the author set. My input is 256X512.

NguyenDangBinh commented 2 years ago

dear all, How to test this train?

XiaoPengZong commented 2 years ago

@leolv131 I found the solution by set --coreset_sampling_ratio very small, like 0.0001 as the author set. My input is 256X512.

Hi, everyone, this method dont solve my problem, I think it is caused by big pickle file as my file is 16M, and the MVtec AD dataset's pickle file is about 1M.

I have tested by set batchsize to 1, but nothing changed. So how to solve this problem?

leolv131 commented 2 years ago

@leolv131 I found the solution by set --coreset_sampling_ratio very small, like 0.0001 as the author set. My input is 256X512.

Hi, everyone, this method dont solve my problem, I think it is caused by big pickle file as my file is 16M, and the MVtec AD dataset's pickle file is about 1M.

I have tested by set batchsize to 1, but nothing changed. So how to solve this problem?

I modify coreset_sampling_ratio when training,

hcw-00 / PatchCore_anomaly_detection

it can train, but it cont test #18