Open lzd-1230 opened 3 months ago
Can you try it with a lower resolution or fewer images? It can be an out-of-memory error.
Can you try it with a lower resolution or fewer images? It can be an out-of-memory error.
Yeah, I try to use 30 imgs(1024*1024) to do the training, and the ssh doesn't crash and I got the following logs
Traceback (most recent call last):
File "/home/lzd/patchcore-inspection/anomalib/train-padim.py", line 56, in <module>
engine.train(datamodule=datamodule, model=model)
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/engine/engine.py", line 863, in train
self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
results = self._run_stage()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
self.fit_loop.run()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 141, in run
self.on_advance_end(data_fetcher)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 295, in on_advance_end
self.val_loop.run()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 114, in run
self.on_run_start()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 244, in on_run_start
self._on_evaluation_start()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 290, in _on_evaluation_start
call._call_lightning_module_hook(trainer, hook_name, *args, **kwargs)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/base/memory_bank_module.py", line 37, in on_validation_start
self.fit()
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/image/padim/lightning_model.py", line 86, in fit
self.stats = self.model.gaussian.fit(embeddings)
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/stats/multi_variate_gaussian.py", line 136, in fit
return self.forward(embedding)
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/stats/multi_variate_gaussian.py", line 117, in forward
covariance = torch.zeros(size=(channel, channel, height * width), device=device)
RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 79298560000 bytes. Error code 12 (Cannot allocate memory)
I'm using tilling because the model is not good when using high resolution images, but there seems not support well for padim, and I can successfully tilling with patchcore
Does Padim work if you use it without tiling? It could be just different memory requirements for Padim and PatchCore.
I think this indeed is an out of memory issue, but it's rather unusual that PatchCore works and Padim doesn't.
Describe the bug
here is code for training
I got 150 good pictures in
good-1024-s
for training, and after I run this script the ssh just lost and seems collapse for some reason without tips.Dataset
Folder
Model
PADiM
Steps to reproduce the behavior
run the code with same 150 1024*1024 imgs
OS information
OS information:
Expected behavior
expected to train
Screenshots
No response
Pip/GitHub
GitHub
What version/branch did you use?
No response
Configuration YAML
Logs
Code of Conduct