ivadomed / model-spinal-rootlets

Deep-learning based segmentation of the spinal nerve rootlets
5 stars 2 forks source link

nnUNetv2 problem with changing patch_size (hc-leipzig-7t-mp2rage) #70

Closed KaterinaKrejci231054 closed 1 week ago

KaterinaKrejci231054 commented 1 month ago

nnUNetv2 problem with changing patch_size

Based on the information from the Ivadomed meeting, I took the following steps with hc-leipzig-7t-mp2rage dataset:

Screenshot from 2024-07-25 16-46-04

@naga-karthik and @valosekj, have you had a similar experience with nnUNet training, please? Do you have any suggestions for how to handle this error, please?

error ```python `Traceback (most recent call last): File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 274, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 210, in run_training nnunet_trainer.run_training() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1295, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 922, in train_step output = self.network(data) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn return fn(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 921, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 786, in _convert_frame result = inner_convert( File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 400, in _convert_frame_assert return _compile( File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 676, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper r = func(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 535, in compile_inner out_code = transform_code_object(code, transform) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1036, in transform_code_object transformations(instructions, code_options) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 165, in _fn return fn(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 500, in transform tracer.run() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2149, in run super().run() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run and self.step() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step getattr(self, inst.opname)(inst) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 489, in wrapper return inner_fn(self, inst) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 1219, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 674, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/nn_module.py", line 336, in call_function return tx.inline_user_function_return( File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 680, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2285, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2399, in inline_call_ tracer.run() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run and self.step() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step getattr(self, inst.opname)(inst) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 489, in wrapper return inner_fn(self, inst) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 1260, in CALL_FUNCTION_EX self.call_function(fn, argsvars.items, kwargsvars) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 674, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 335, in call_function return super().call_function(tx, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 289, in call_function return super().call_function(tx, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function return tx.inline_user_function_return( File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 680, in inline_user_function_return return InliningInstructionTranslator.inline_call(self, fn, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2285, in inline_call return cls.inline_call_(parent, func, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2399, in inline_call_ tracer.run() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run and self.step() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step getattr(self, inst.opname)(inst) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 489, in wrapper return inner_fn(self, inst) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 1219, in CALL_FUNCTION self.call_function(fn, args, {}) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 674, in call_function self.push(fn.call_function(self, args, kwargs)) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/torch.py", line 679, in call_function tensor_variable = wrap_fx_proxy( File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/builder.py", line 1330, in wrap_fx_proxy return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/builder.py", line 1415, in wrap_fx_proxy_cls example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1714, in get_fake_value raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1656, in get_fake_value ret_val = wrap_fake_exception( File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1190, in wrap_fake_exception return fn() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1657, in lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1782, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1764, in run_node return node.target(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/utils/_stats.py", line 20, in wrapper return fn(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 896, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 1241, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 966, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 1458, in _dispatch_impl r = func(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_ops.py", line 594, in __call__ return self_._op(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_prims_common/wrappers.py", line 252, in _fn result = fn(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_prims_common/wrappers.py", line 137, in _fn result = fn(**bound.arguments) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_refs/__init__.py", line 2799, in cat return prims.cat(filtered, dim).clone(memory_format=memory_format) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_ops.py", line 594, in __call__ return self_._op(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_prims/__init__.py", line 1917, in _cat_meta torch._check( File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/__init__.py", line 1140, in _check _check_with(RuntimeError, cond, message) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/__init__.py", line 1123, in _check_with raise error_type(message_evaluated) torch._dynamo.exc.TorchRuntimeError: Failed running call_function (*((FakeTensor(..., device='cuda:0', size=(2, 320, 16, 24, 12), dtype=torch.float16, grad_fn=), FakeTensor(..., device='cuda:0', size=(2, 320, 16, 23, 12), dtype=torch.float16, grad_fn=)), 1), **{}): Sizes of tensors must match except in dimension 1. Expected 24 but got 23 for tensor number 1 in the list from user code: File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/dynamic_network_architectures/architectures/unet.py", line 62, in forward return self.decoder(skips) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/unet_decoder.py", line 110, in forward x = torch.cat((x, skips[-(s+2)]), 1) Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True Exception in thread Thread-2: Traceback (most recent call last): File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message Exception in thread Thread-1: Traceback (most recent call last): File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop raise e File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message` ```
valosekj commented 1 month ago

tried to modify the patch_size parameter (original: [128, 192, 96]) in the nnUNetPlans.json file to median_image_size_in_voxels.

Screenshot from 2024-07-25 16-46-04

Okay. Based on our today's in-person discussion, I thought that you were modifying the patch size only along the S-I axis. (to ensure that the model always has the context about all the rootlet levels). But based on the screenshot, you're actually changing all the axes. So maybe the problem might be indeed related to memory issue.

naga-karthik commented 1 month ago

Agree with Jan's comment about the memory issue. The patch size might be too big! AND, more importantly, the patch-size you chose is not divisible by 2**x where x=3, 4, or 5. Usually, patch sizes are divided by 2 multiple times depending on the number of layers in nnunet (maybe 4 or 5) during training so it's usually good to ensure that the patch size you choose are divisible by 2**4 (=16) or 2**5 (=32)

valosekj commented 1 month ago

fyi I manually modified the patch_size for lumbar model training and training has started; details: https://github.com/ivadomed/model-spinal-rootlets/issues/67#issuecomment-2252641123

KaterinaKrejci231054 commented 1 month ago

Thanks for the suggestions and for the help @valosekj and @naga-karthik - I tried to modify only the SI patch size - with the value 368 (23 16) in SI it crashed again because of memory, so I tried a smaller multiple - 352 (22 16) and with that it started to train correctly.

image