when adding a new dataset, I can't run it with another dataset.

yoelshimi commented 1 week ago

Hi, recently I created using your framework a new dataset of my own "Homography" dataset. When I try to train a model, using it and another standard dataset, I get the following series of errors: I train the model neuflow (I tried to shop around and switch model-- didn't help) datasets: sintel + homography single GPU the run command I use is: ptlflow/train.py neuflow --train_dataset sintel+homography --val_dataset sintel+homography --train_transform_cuda --train_num_workers 2 --train_batch_size 8 now when I start to run I get the issue: python3.10/site-packages/torch/cuda/init.py", line 284, in _lazy_init raise RuntimeError( RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

running with just sintel works OK, and just homography dataset also.

if I add the command upon the import of train.py, before importing lightning: torch.multiprocessing.set_start_method("spawn")

then I get later on when trying to pass my data to the GPU in the dataloader: File "/home/.../work/OIS/ptlflow/ptlflow/data/flow_transforms.py", line 135, in call inputs[k] = torch.from_numpy(v).to(device=self.device, dtype=self.dtype) RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

When looking at the GPU usage, I find that it isn't being used.

I'd appreciate your help. Thanks!

I tried this in python3.8, then 3.10.9 (currently running) pytorch-cuda 11.6 h867d48c_1 pytorch pytorch-lightning 2.4.0 pypi_0 pypi pytorch-msssim 1.0.0 pypi_0 pypi pytorch-mutex 1.0 cuda pytorch torch 2.1.0 pypi_0 pypi torchmetrics 1.4.3 pypi_0 pypi torchsummaryx 1.3.0 pypi_0 pypi torchvision 0.16.0 pypi_0 pypi

conda list | grep lightning lightning 1.9.0 pypi_0 pypi lightning-cloud 0.5.70 pypi_0 pypi lightning-utilities 0.11.7 pypi_0 pypi pytorch-lightning 2.4.0 pypi_0 pypi

yoelshimi commented 1 week ago

the spawn idea comes from: https://stackoverflow.com/questions/72779926/gunicorn-cuda-cannot-re-initialize-cuda-in-forked-subprocess

hmorimitsu commented 1 week ago

Hi, thanks for reporting.

I think this error is caused when using --train_transform_cuda with multiple GPUs. You can try to remove this flag, or use a single GPU.

Unfortunately, I am also not sure what causes this error, and it requires further debugging. In my personal tests this happens with some combinations of datasets. Sometimes the behavior also changes depending on the machine.

If hope that helps.

Best.

yoelshimi commented 1 week ago

thanks for your response. I removed the train transform cuda flag, re ran on a single GPU and now I get: File "/home/.../miniconda3/envs/./lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e RuntimeError: DataLoader worker (pid(s) 24803) exited unexpectedly

this happens after it finishes loading the dataset into memory, and on the first training epoch/step. note that validation on multiple datasets does seem to work.

I also changed the dataloader definition in base_model.py the timeout parameter is very large, so I don't get a timeout

hmorimitsu commented 1 week ago

Hmm, I have never had this error before. Does this only happen when using your new dataset or does it also happen when training with sintel only?

Does the stack trace tell where this error starts in the ptlflow code?

yoelshimi commented 1 week ago

hi, the full trace: Oops! <class 'RuntimeError'> occurred. 000305 (<class 'RuntimeError'>, RuntimeError('DataLoader worker (pid(s) 16143) exited unexpectedly'), <traceback object at 0x7f0aad647080>) 000306 Traceback (most recent call last): 000307 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data 000308 data = self._data_queue.get(timeout=timeout) 000309 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/multiprocessing/queues.py", line 114, in get 000310 raise Empty 000311 _queue.Empty 000312
000313 The above exception was the direct cause of the following exception: 000314
000315 Traceback (most recent call last): 000316 File "/scripts/module_wrapper.py", line 128, in main 000317 mod.main(arguments.split()) 000318 File "/home/yoels/work/OIS/ptlflow/train_OIS.py", line 236, in main 000319 train(args) 000320 File "/home/yoels/work/OIS/ptlflow/train_OIS.py", line 189, in train 000321 trainer.fit(model) 000322 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 608, in fit 000323 call._call_and_handle_interrupt( 000324 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 36, in _call_and_handle_interrupt 000325 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) 000326 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 88, in launch 000327 return function(*args, *kwargs) 000328 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl 000329 self._run(model, ckpt_path=self.ckpt_path) 000330 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run 000331 results = self._run_stage() 000332 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage 000333 self._run_train() 000334 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train 000335 self.fit_loop.run() 000336 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/loops/loop.py", line 199, in run 000337 self.advance(args, kwargs) 000338 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance 000339 self._outputs = self.epoch_loop.run(self._data_fetcher) 000340 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/loops/loop.py", line 199, in run 000341 self.advance(*args, *kwargs) 000342 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 187, in advance 000343 batch = next(data_fetcher) 000344 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/utilities/fetching.py", line 184, in next 000345 return self.fetching_function() 000346 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/utilities/fetching.py", line 265, in fetching_function 000347 self._fetch_next_batch(self.dataloader_iter) 000348 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/utilities/fetching.py", line 280, in _fetch_next_batch 000349 batch = next(iterator) 000350 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/supporters.py", line 569, in next 000351 return self.request_next_batch(self.loader_iters) 000352 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/supporters.py", line 581, in request_next_batch 000353 return apply_to_collection(loader_iters, Iterator, next) 000354 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 64, in apply_to_collection 000355 return function(data, args, **kwargs) 000356 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next 000357 data = self._next_data() 000358 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data 000359 idx, data = self._get_data() 000360 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1277, in _get_data 000361 success, data = self._try_get_data(self._timeout) 000362 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data 000363 raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e 000364 RuntimeError: DataLoader worker (pid(s) 16143) exited unexpectedly

I get these errors only when I train with both datasets (validation test works OK). I wrote a collate function myself, but this occurs often even when I num_workers = 0, batch_size=1 I haven't checked too many other datasets recently so I'll retry that.

hmorimitsu commented 1 week ago

OK, thank you. There's not much to see in the stack trace.

Unfortunately, I also don't know what is the cause of this problem. Here are a few suggestions that may help:

Try to train with only the homography dataset alone; does the error disappear?
Train with other two standard datasets (e.g., sintel+chairs) to see if the error is caused by using multiple datasets
Try to print the file path during the loading to see if it is a particular file that is causing the problem

yoelshimi commented 1 week ago

hi

when I train with Homography dataset only in train but verify with homography + sintel it runs (has been training for a day or so now).
when I train with other datasets, I seem to still get a similar error but not in the same place:
for the run command: train.py with: -A neuflow --train_dataset autoflow+homography --val_dataset sintel+chairs --train_num_workers 96 --train_batch_size 8
DataLoader worker (pid 31543) is killed by signal: Killed. File "/.../ptlflow/ptlflow/models/neuflow/backbone.py", line 35, in forward return self.norm(x1 + x2) File "/.../ptlflow/ptlflow/models/neuflow/backbone.py", line 128, in forward x2 = self.block2(img) File "/.../ptlflow/ptlflow/models/neuflow/neuflow.py", line 145, in forward feature0_s8, feature0_s16 = self.backbone(img0) File "/.../ptlflow/ptlflow/models/base_model/base_model.py", line 411, in training_step preds = self(batch)

RuntimeError: DataLoader worker (pid 31543) is killed by signal: Killed. I don't thinks it's a memory issue but it could be, I'm running on a single RTX A6000 + 16 CPUs with 64 GB RAM overall.

as it appears even when I switch datasets, but the error happens when the entire batch is also from the same dataset. It can switch location a bit, but the error is always that a worker was killed.

Do you have any further ideas? Thanks

hmorimitsu commented 6 days ago

Have you tried to use sintel+chairs in the --train_dataset as well? I also wonder if --train_num_workers 96 is not creating too many threads, which could be causing the process killing.

On Tue, Oct 15, 2024 at 10:58 PM yoel sanders @.***> wrote:

hi

when I train with Homography dataset only in train but verify with homography + sintel it runs (has been training for a day or so now).

when I train with other datasets, I seem to still get a similar error but not in the same place:

for the run command: train.py with: -A neuflow --train_dataset autoflow+homography --val_dataset sintel+chairs --train_num_workers 96 --train_batch_size 8

DataLoader worker (pid 31543) is killed by signal: Killed. File "/.../ptlflow/ptlflow/models/neuflow/backbone.py", line 35, in forward return self.norm(x1 + x2) File "/.../ptlflow/ptlflow/models/neuflow/backbone.py", line 128, in forward x2 = self.block2(img) File "/.../ptlflow/ptlflow/models/neuflow/neuflow.py", line 145, in forward feature0_s8, feature0_s16 = self.backbone(img0) File "/.../ptlflow/ptlflow/models/base_model/base_model.py", line 411, in training_step preds = self(batch)

RuntimeError: DataLoader worker (pid 31543) is killed by signal: Killed. I don't thinks it's a memory issue but it could be, I'm running on a single RTX A6000 + 16 CPUs with 64 GB RAM overall.

as it appears even when I switch datasets, but the error happens when the entire batch is also from the same dataset. It can switch location a bit, but the error is always that a worker was killed.

Do you have any further ideas? Thanks

— Reply to this email directly, view it on GitHub https://github.com/hmorimitsu/ptlflow/issues/76#issuecomment-2414179813, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF2KE3KXRAUTYEV3EDJM3ZLZ3UUS7AVCNFSM6AAAAABP4QXQY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJUGE3TSOBRGM . You are receiving this because you commented.Message ID: @.***>

yoelshimi commented 6 days ago

Hi, so after multiple attempts, it seems as follows: If I run the same datasets for train + validation, sintel+homography together, then it runs the train.py and I don't get the signal kill error (at least not yet, after 1 hour). Not sure the cause, will keep you updated if I find something out.
Thanks

hmorimitsu / ptlflow

when adding a new dataset, I can't run it with another dataset. #76