Closed PonteIneptique closed 1 year ago
I must say this specifically happens with arrows. I have not checked if it happens with other form of datasets (and the issue when not dealing with PL might be unrelated).
It's probably not a real memory leak but memory fragmentation which causes pytorch's caching allocator to grab more memory to have sufficiently large unused blocks. Seen that in the pretraining code. PL doesn't seem to move the metrics to CPU when logging per default so that could be one of the reasons. I'll push a patched version setting move_metrics_to_cpu
to True
tomorrow. That might be all that is needed.
Yeah, figured as much that something like that might be the reason: the fact it clearly accumulates and reset per epoch definitely points this way :) Thanks for deadling with that :)
Hi i suppose to got the same problem but i use the last kraken version Perhaps i am too ambitious with my 677 xml files. Should i use the binary format to avoid this error. Any ideas ? Thanks
Configuration: kraken-4.2.1.dev82-py3.8.egg-info
ketos -v train -o 2022-12-16-NP-Model -f page -t allxmls.xmllist --device cuda:0 -u NFD -p 0.8 --workers 8 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001
Logs:
[12/16/22 10:53:51] INFO Parsing 677 XML files for training train.py:200
data
WARNING Region eScdummyblock without xml.py:242
coordinates
[12/16/22 10:53:52] WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
INFO TextLine eSc_line_a03b9996 without xml.py:269
polygon
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
[12/16/22 10:53:53] WARNING Region eScdummyblock without xml.py:242
coordinates
[12/16/22 10:53:55] WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
WARNING Region eScdummyblock without xml.py:242
coordinates
[12/16/22 10:53:59] WARNING No boundary given for line train.py:56
[12/16/22 10:54:01] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 10:54:04] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 10:54:14] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 10:54:18] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 10:54:21] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 10:54:34] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 10:57:07] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 10:57:18] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 11:08:47] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 11:15:07] WARNING Text line "" is empty after train.py:56
transformations
[12/16/22 11:15:21] INFO No explicit validation data provided. train.py:299
Splitting off 6438 (of 32189) samples
to validation set. (Will disable
alphabet mismatch detection.)
INFO Training set 25751 lines, validation train.py:308
set 6438 lines, alphabet 100 symbols
INFO grapheme count train.py:319
INFO SPACE 191642 train.py:324
INFO e 135169 train.py:324
INFO i 121409 train.py:324
INFO a 102716 train.py:324
INFO t 86202 train.py:324
INFO n 85035 train.py:324
INFO s 81056 train.py:324
INFO r 77410 train.py:324
INFO o 76629 train.py:324
INFO u 68132 train.py:324
INFO m 50032 train.py:324
INFO c 41153 train.py:324
INFO l 39492 train.py:324
INFO d 36762 train.py:324
INFO , 26694 train.py:324
INFO p 24221 train.py:324
INFO b 13685 train.py:324
INFO v 13234 train.py:324
INFO f 10970 train.py:324
INFO g 10831 train.py:324
INFO . 9800 train.py:324
INFO q 9215 train.py:324
INFO h 7981 train.py:324
INFO I 7348 train.py:324
INFO S 6279 train.py:324
INFO z 6186 train.py:324
INFO E 5479 train.py:324
INFO ; 4969 train.py:324
INFO x 4435 train.py:324
INFO C 4118 train.py:324
INFO M 3814 train.py:324
INFO D 3589 train.py:324
INFO A 3436 train.py:324
INFO G 2903 train.py:324
INFO F 2530 train.py:324
INFO V 2480 train.py:324
INFO X 2355 train.py:324
INFO L 2164 train.py:324
INFO B 1845 train.py:324
INFO P 1838 train.py:324
INFO T 1365 train.py:324
INFO N 1234 train.py:324
INFO R 1073 train.py:324
INFO : 1033 train.py:324
INFO O 950 train.py:324
INFO y 820 train.py:324
INFO ' 648 train.py:324
INFO j 638 train.py:324
INFO H 510 train.py:324
INFO COMBINING ACUTE ACCENT 461 train.py:324
INFO Q 299 train.py:324
INFO U 290 train.py:324
INFO k 285 train.py:324
INFO + 277 train.py:324
INFO ’ 208 train.py:324
INFO 168 train.py:324
INFO Z 167 train.py:324
INFO J 149 train.py:324
INFO COMBINING GRAVE ACCENT 142 train.py:324
INFO COMBINING TILDE 141 train.py:324
INFO " 100 train.py:324
INFO COMBINING MACRON 64 train.py:324
INFO COMBINING CEDILLA 63 train.py:324
INFO ° 58 train.py:324
INFO / 42 train.py:324
INFO COMBINING OGONEK 41 train.py:324
INFO & 39 train.py:324
INFO W 36 train.py:324
INFO K 26 train.py:324
INFO ꝑ 24 train.py:324
INFO ? 22 train.py:324
INFO đ 21 train.py:324
INFO 4 20 train.py:324
INFO 2 20 train.py:324
INFO 3 20 train.py:324
INFO 1 20 train.py:324
INFO ꝓ 20 train.py:324
INFO w 19 train.py:324
INFO 5 17 train.py:324
INFO 6 16 train.py:324
INFO [ 12 train.py:324
INFO ] 12 train.py:324
INFO æ 11 train.py:324
INFO Y 9 train.py:324
INFO 9 8 train.py:324
INFO > 8 train.py:324
INFO 7 7 train.py:324
INFO 8 7 train.py:324
INFO 0 7 train.py:324
INFO - 5 train.py:324
INFO ( 5 train.py:324
INFO ) 5 train.py:324
INFO • 4 train.py:324
INFO COMBINING CIRCUMFLEX ACCENT 4 train.py:324
INFO ꝝ 4 train.py:324
INFO ꝙ 4 train.py:324
INFO COMBINING DIAERESIS 3 train.py:324
INFO ꝛ 2 train.py:324
INFO < 1 train.py:324
INFO # 1 train.py:324
INFO Encoding training set train.py:337
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Trainer(val_check_interval=1.0)
was configured so validation will run at the end of the training epoch..
[12/16/22 11:15:38] INFO Creating new model [1,120,0,1 train.py:475
Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32
Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2
Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200
Do0.1,2 Lbx200 Do.1,2 Lbx200 Do] with
101 outputs
[12/16/22 11:15:54] INFO Setting seg_type to baselines. train.py:495
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │
│ ainer/call.py:38 in _call_and_handle_interrupt │
│ │
│ 35 │ │ if trainer.strategy.launcher is not None: │
│ 36 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, args, │
│ 37 │ │ else: │
│ ❱ 38 │ │ │ return trainer_fn(*args, kwargs) │
│ 39 │ │
│ 40 │ except _TunerExitException: │
│ 41 │ │ trainer._call_teardown_hook() │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │
│ ainer/trainer.py:624 in _fit_impl │
│ │
│ 621 │ │ │ model_provided=True, │
│ 622 │ │ │ model_connected=self.lightning_module is not None, │
│ 623 │ │ ) │
│ ❱ 624 │ │ self._run(model, ckpt_path=self.ckpt_path) │
│ 625 │ │ │
│ 626 │ │ assert self.state.stopped │
│ 627 │ │ self.training = False │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │
│ ainer/trainer.py:1046 in _run │
│ │
│ 1043 │ │ │
│ 1044 │ │ # hook │
│ 1045 │ │ if self.state.fn == TrainerFn.FITTING: │
│ ❱ 1046 │ │ │ self._call_callback_hooks("on_fit_start") │
│ 1047 │ │ │ self._call_lightning_module_hook("on_fit_start") │
│ 1048 │ │ │
│ 1049 │ │ self._log_hyperparams() │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │
│ ainer/trainer.py:1343 in _call_callback_hooks │
│ │
│ 1340 │ │ │ fn = getattr(callback, hook_name) │
│ 1341 │ │ │ if callable(fn): │
│ 1342 │ │ │ │ with self.profiler.profile(f"[Callback]{callback.stat │
│ ❱ 1343 │ │ │ │ │ fn(self, self.lightning_module, args, kwargs) │
│ 1344 │ │ │
│ 1345 │ │ if pl_module: │
│ 1346 │ │ │ # restore current_fx when nested context │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ca │
│ llbacks/model_summary.py:59 in on_fit_start │
│ │
│ 56 │ │ if not self._max_depth: │
│ 57 │ │ │ return None │
│ 58 │ │ │
│ ❱ 59 │ │ model_summary = self._summary(trainer, pl_module) │
│ 60 │ │ summary_data = model_summary._get_summary_data() │
│ 61 │ │ total_parameters = model_summary.total_parameters │
│ 62 │ │ trainable_parameters = model_summary.trainable_parameters │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ca │
│ llbacks/model_summary.py:73 in _summary │
│ │
│ 70 │ │ │
│ 71 │ │ if isinstance(trainer.strategy, DeepSpeedStrategy) and trainer. │
│ 72 │ │ │ return DeepSpeedSummary(pl_module, max_depth=self._max_dept │
│ ❱ 73 │ │ return summarize(pl_module, max_depth=self._max_depth) │
│ 74 │ │
│ 75 │ @staticmethod │
│ 76 │ def summarize( │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ut │
│ ilities/model_summary/model_summary.py:431 in summarize │
│ │
│ 428 │ Return: │
│ 429 │ │ The model summary object │
│ 430 │ """ │
│ ❱ 431 │ return ModelSummary(lightning_module, max_depth=max_depth) │
│ 432 │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ut │
│ ilities/model_summary/model_summary.py:189 in init │
│ │
│ 186 │ │ │ raise ValueError(f"max_depth
can be -1, 0 or > 0, got {m │
│ 187 │ │ │
│ 188 │ │ self._max_depth = max_depth │
│ ❱ 189 │ │ self._layer_summary = self.summarize() │
│ 190 │ │ # 1 byte -> 8 bits │
│ 191 │ │ # TODO: how do we compute precision_megabytes in case of mixed │
│ 192 │ │ precision = self._model.precision if isinstance(self._model.pr │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ut │
│ ilities/model_summary/model_summary.py:246 in summarize │
│ │
│ 243 │ def summarize(self) -> Dict[str, LayerSummary]: │
│ 244 │ │ summary = OrderedDict((name, LayerSummary(module)) for name, m │
│ 245 │ │ if self._model.example_input_array is not None: │
│ ❱ 246 │ │ │ self._forward_example_input() │
│ 247 │ │ for layer in summary.values(): │
│ 248 │ │ │ layer.detach_hook() │
│ 249 │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ut │
│ ilities/model_summary/model_summary.py:278 in _forward_exampleinput │
│ │
│ 275 │ │ │ elif isinstance(input, dict): │
│ 276 │ │ │ │ model(input) │
│ 277 │ │ │ else: │
│ ❱ 278 │ │ │ │ model(input) │
│ 279 │ │ model.train(mode) # restore mode of module │
│ 280 │ │
│ 281 │ def _get_summary_data(self) -> List[Tuple[str, List[str]]]: │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │
│ ule.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forwardhooks or self. │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1130 │ │ │ return forward_call(input, kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/kraken/lib/train.py: │
│ 368 in forward │
│ │
│ 365 │ │ return dataset │
│ 366 │ │
│ 367 │ def forward(self, x, seq_lens=None): │
│ ❱ 368 │ │ return self.net(x, seq_lens) │
│ 369 │ │
│ 370 │ def training_step(self, batch, batch_idx): │
│ 371 │ │ input, target = batch['image'], batch['target'] │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │
│ ule.py:1148 in _call_impl │
│ │
│ 1145 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1146 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1147 │ │ │
│ ❱ 1148 │ │ result = forward_call(*input, kwargs) │
│ 1149 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1150 │ │ │ for hook in (_global_forward_hooks.values(), self._forw │
│ 1151 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/kraken/lib/layers.py │
│ :27 in forward │
│ │
│ 24 │ def forward(self, inputs): │
│ 25 │ │ for module in self._modules.values(): │
│ 26 │ │ │ if type(inputs) == tuple: │
│ ❱ 27 │ │ │ │ inputs = module(inputs) │
│ 28 │ │ │ else: │
│ 29 │ │ │ │ inputs = module(inputs) │
│ 30 │ │ return inputs │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │
│ ule.py:1148 in _call_impl │
│ │
│ 1145 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1146 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1147 │ │ │
│ ❱ 1148 │ │ result = forward_call(*input, *kwargs) │
│ 1149 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1150 │ │ │ for hook in (_global_forward_hooks.values(), self._forw │
│ 1151 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/kraken/lib/layers.py │
│ :774 in forward │
│ │
│ 771 │ │ │ │ │ │ │ │ stride=stride, padding=self.padding) │
│ 772 │ │
│ 773 │ def forward(self, inputs: torch.Tensor, seq_len: Optional[torch.Te │
│ ❱ 774 │ │ o = self.co(inputs) │
│ 775 │ │ # return logits for sigmoid activation during training │
│ 776 │ │ if not (self.nl_name == 'SIGMOID' and self.training): │
│ 777 │ │ │ o = self.nl(o) │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │
│ ule.py:1148 in _call_impl │
│ │
│ 1145 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │
│ 1146 │ │ │ input = bw_hook.setup_input_hook(input) │
│ 1147 │ │ │
│ ❱ 1148 │ │ result = forward_call(input, kwargs) │
│ 1149 │ │ if _global_forward_hooks or self._forward_hooks: │
│ 1150 │ │ │ for hook in (_global_forward_hooks.values(), self._forw │
│ 1151 │ │ │ │ hook_result = hook(self, input, result) │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/con │
│ v.py:457 in forward │
│ │
│ 454 │ │ │ │ │ │ self.padding, self.dilation, self.groups) │
│ 455 │ │
│ 456 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 457 │ │ return self._conv_forward(input, self.weight, self.bias) │
│ 458 │
│ 459 class Conv3d(_ConvNd): │
│ 460 │ doc = r"""Applies a 3D convolution over an input signal compo │
│ │
│ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/con │
│ v.py:453 in _conv_forward │
│ │
│ 450 │ │ │ return F.conv2d(F.pad(input, self._reversed_padding_repea │
│ 451 │ │ │ │ │ │ │ weight, bias, self.stride, │
│ 452 │ │ │ │ │ │ │ _pair(0), self.dilation, self.groups) │
│ ❱ 453 │ │ return F.conv2d(input, weight, bias, self.stride, │
│ 454 │ │ │ │ │ │ self.padding, self.dilation, self.groups) │
│ 455 │ │
│ 456 │ def forward(self, input: Tensor) -> Tensor: │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: an illegal memory access was encountered
During handling of the above exception, another exception occurred:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/home/p/pp/.local/bin/ketos:10 in Trainer.fit()
requires a LightningMo │ │ 581 │ │ self.strategy._lightning_module = model │ │ ❱ 582 │ │ call._call_and_handle_interrupt( │ │ 583 │ │ │ self, self._fit_impl, model, train_dataloaders, val_datal │ │ 584 │ │ ) │ │ 585 │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/call.py:63 in _call_and_handle_interrupt │ │ │ │ 60 │ │ trainer._call_callback_hooks("on_exception", exception) │ │ 61 │ │ for logger in trainer.loggers: │ │ 62 │ │ │ logger.finalize("failed") │ │ ❱ 63 │ │ trainer._teardown() │ │ 64 │ │ # teardown might access the stage so we reset it after │ │ 65 │ │ trainer.state.stage = None │ │ 66 │ │ raise │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/trainer.py:1124 in _teardown │ │ │ │ 1121 │ def _teardown(self) -> None: │ │ 1122 │ │ """This is the Trainer's internal teardown, unrelated to the │ │ 1123 │ │ Callback; those are handled by :meth:
_call_teardown_hook."" │ │ ❱ 1124 │ │ self.strategy.teardown() │ │ 1125 │ │ loop = self._active_loop │ │ 1126 │ │ # loop should never be
Nonehere but it can because we don' │ │ 1127 │ │ if loop is not None: │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/st │ │ rategies/strategy.py:496 in teardown │ │ │ │ 493 │ │ │ │ 494 │ │ if self.lightning_module is not None: │ │ 495 │ │ │ log.detail(f"{self.__class__.__name__}: moving model to CP │ │ ❱ 496 │ │ │ self.lightning_module.cpu() │ │ 497 │ │ self.precision_plugin.teardown() │ │ 498 │ │ assert self.accelerator is not None │ │ 499 │ │ self.accelerator.teardown() │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/lightning_lite/utili │ │ ties/device_dtype_mixin.py:78 in cpu │ │ │ │ 75 │ def cpu(self) -> Self: # type: ignore[valid-type] │ │ 76 │ │ """See :meth:
torch.nn.Module.cpu.""" │ │ 77 │ │ self.__update_properties(device=torch.device("cpu")) │ │ ❱ 78 │ │ return super().cpu() │ │ 79 │ │ │ 80 │ def type(self, dst_type: Union[str, torch.dtype]) -> Self: # type │ │ 81 │ │ """See :meth:
torch.nn.Module.type.""" │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:738 in cpu │ │ │ │ 735 │ │ Returns: │ │ 736 │ │ │ Module: self │ │ 737 │ │ """ │ │ ❱ 738 │ │ return self._apply(lambda t: t.cpu()) │ │ 739 │ │ │ 740 │ def type(self: T, dst_type: Union[dtype, str]) -> T: │ │ 741 │ │ r"""Casts all parameters and buffers to :attr:
dst_type. │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:602 in _apply │ │ │ │ 599 │ │ │ # track autograd history of
param_applied, so we have t │ │ 600 │ │ │ #
with torch.no_grad():│ │ 601 │ │ │ with torch.no_grad(): │ │ ❱ 602 │ │ │ │ param_applied = fn(param) │ │ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │ │ 604 │ │ │ if should_use_set_data: │ │ 605 │ │ │ │ param.data = param_applied │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:738 in <lambda> │ │ │ │ 735 │ │ Returns: │ │ 736 │ │ │ Module: self │ │ 737 │ │ """ │ │ ❱ 738 │ │ return self._apply(lambda t: t.cpu()) │ │ 739 │ │ │ 740 │ def type(self: T, dst_type: Union[dtype, str]) -> T: │ │ 741 │ │ r"""Casts all parameters and buffers to :attr:
dst_type`. │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: an illegal memory access was encountered
Accessing unallocated memory is a bug inside pytorch that isn't really connected to kraken. Which pytorch version are you using?
You'll definitely see much higher performance with binary datasets in any case though.
pytorch_lightning-1.8.3.post1.dist-info
That's just the abstraction library. The actual pytorch version you can get with:
pip list | grep torch
pytorch-lightning 1.8.3.post1 torch 1.12.1+cu116 torchaudio 0.12.1+cu116 torchmetrics 0.11.0 torchvision 0.13.1+cu116
Hmm, looks OK (no weird nightly builds or anything like that). You might try to upgrade the torch version to the latest stable 1.13.1 (pip install -U torch
) and see if the error disappears.
it runs !!!!
pytorch-lightning 1.8.3.post1 torch 1.13.1 torchaudio 0.12.1+cu116 torchmetrics 0.11.0 torchvision 0.13.1+cu116
Perhaps during the installation, i got this message "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchvision 0.13.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible. torchaudio 0.12.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible." and decided to downgrad the torch library. thanks
Hey @mittagessen, It seems there is a new issue since you moved to Pytorch Lightning. I have seen multiple time a CUDA memory running out, specifically on large dataset (basically the CUDA memory fills itself as it moves forward in the epoch). Did you see anything similar ?
I must note that moving from not PL to PL I have seen the same issue with another tool, where the memory is filled faster / more than before, despite using the exact same architecture and data.