Error in running train.py

SaileshAI commented 5 months ago

Hi

I am trying to run the training script with the provided dataset. I am facing this below error -

run is terminating due to exception: 't2_meta_dict' [00:00<?] 2024-05-10 17:04:33,995 - ERROR - Exception: 't2_meta_dict' Traceback (most recent call last): File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/ignite/engine/engine.py", line 1069, in _run_once_on_dataset_as_gen self._fire_event(Events.ITERATION_COMPLETED) File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/ignite/engine/engine.py", line 425, in _fire_event func(first, (event_args + others), **kwargs) File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/monai/handlers/metrics_saver.py", line 124, in _get_filenames meta_data = self.batch_transform(engine.state.batch) File "/home/cognida/Desktop/Lens/Sanskar/prostate158/repo/prostate158/train.py", line 385, in _get_meta_dict return [item[key] for item in batch] File "/home/cognida/Desktop/Lens/Sanskar/prostate158/repo/prostate158/train.py", line 385, in return [item[key] for item in batch] KeyError: 't2_meta_dict' Engine run is terminating due to exception: 't2_meta_dict' 2024-05-10 17:04:34,027 - ERROR - Exception: 't2_meta_dict'

I have not modified any script. Any idea why it is failing ?

kbressem commented 5 months ago

With newer MONAI versions, the API changed and they now use metatensor. Try downgrading MONAI to the version before Metatensor was introduced. Maybe this helps.

SaileshAI commented 5 months ago

Hi @kbressem , yes thanks for this, I was able to start the training script. However, after 2 epochs, in epoch#3, I encountered the following issue -

[12:07 AM] Sanskar Khandelwal Epoch [3/500]: [24/60] 40%|████████████████████████████████████████████████████████████▊ , loss=1.8 [00:26<00:30]Current run is terminating due to exception: received 0 items of ancdata

Exception: received 0 items of ancdata

Traceback (most recent call last):

File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/ignite/engine/engine.py", line 1032, in _run_once_on_dataset_as_gen

self.state.batch = next(self._dataloader_iter)

File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next

data = self._next_data()

File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data

idx, data = self._get_data()

File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data

success, data = self._try_get_data()

File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data

data = self._data_queue.get(timeout=timeout)

File "/usr/lib/python3.10/multiprocessing/queues.py", line 122, in get

return _ForkingPickler.loads(res)

File "/home/cognida/Desktop/Lens/Sanskar/prostate158/prostate/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd

fd = df.detach()

File "/usr/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach

return reduction.recv_handle(conn)

File "/usr/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle

return recvfds(s, 1)[0]

File "/usr/lib/python3.10/multiprocessing/reduction.py", line 164, in recvfds

raise RuntimeError('received %d items of ancdata' %

RuntimeError: received 0 items of ancdata

Engine run is terminating due to exception: received 0 items of ancdata

Is it another dependent package's version issue or something else ?

kbressem commented 5 months ago

This means a worker died in the dataloader. This is a pytoch issue. Try to reduce the number of workers in the data loader.

SaileshAI commented 5 months ago

This means a worker died in the dataloader. This is a pytoch issue. Try to reduce the number of workers in the data loader.

I see, and yes, configuring the num_workers to a lower number worked. Any idea on how do I infer on the trained models ? like do I infer over an image or 'nii' file ? (I am kind of new to the radiology files in form of NII). Asking this because I could not find the script to infer over a single sample of image/nii .

github-actions[bot] commented 3 months ago

Stale issue message

kbressem commented 3 months ago

You can add the image to the test dataset and then infer over it. This would be the most straight forward way with this library. The README shows the code at the bottom.

github-actions[bot] commented 1 month ago

Stale issue message

kbressem / prostate158

Error in running train.py #9