mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
743 stars 130 forks source link

Cuda "Memory leak" ? #399

Closed PonteIneptique closed 1 year ago

PonteIneptique commented 2 years ago

Hey @mittagessen, It seems there is a new issue since you moved to Pytorch Lightning. I have seen multiple time a CUDA memory running out, specifically on large dataset (basically the CUDA memory fills itself as it moves forward in the epoch). Did you see anything similar ?

I must note that moving from not PL to PL I have seen the same issue with another tool, where the memory is filled faster / more than before, despite using the exact same architecture and data.

PonteIneptique commented 2 years ago

I must say this specifically happens with arrows. I have not checked if it happens with other form of datasets (and the issue when not dealing with PL might be unrelated).

mittagessen commented 2 years ago

It's probably not a real memory leak but memory fragmentation which causes pytorch's caching allocator to grab more memory to have sufficiently large unused blocks. Seen that in the pretraining code. PL doesn't seem to move the metrics to CPU when logging per default so that could be one of the reasons. I'll push a patched version setting move_metrics_to_cpu to True tomorrow. That might be all that is needed.

PonteIneptique commented 2 years ago

Yeah, figured as much that something like that might be the reason: the fact it clearly accumulates and reset per epoch definitely points this way :) Thanks for deadling with that :)

particitae commented 1 year ago

Hi i suppose to got the same problem but i use the last kraken version Perhaps i am too ambitious with my 677 xml files. Should i use the binary format to avoid this error. Any ideas ? Thanks

Configuration: kraken-4.2.1.dev82-py3.8.egg-info

ketos -v train -o 2022-12-16-NP-Model -f page -t allxmls.xmllist --device cuda:0 -u NFD -p 0.8 --workers 8 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' -r 0.0001

Logs:

[12/16/22 10:53:51] INFO Parsing 677 XML files for training train.py:200 data
WARNING Region eScdummyblock without xml.py:242 coordinates
[12/16/22 10:53:52] WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
INFO TextLine eSc_line_a03b9996 without xml.py:269 polygon
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
[12/16/22 10:53:53] WARNING Region eScdummyblock without xml.py:242 coordinates
[12/16/22 10:53:55] WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
WARNING Region eScdummyblock without xml.py:242 coordinates
[12/16/22 10:53:59] WARNING No boundary given for line train.py:56 [12/16/22 10:54:01] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 10:54:04] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 10:54:14] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 10:54:18] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 10:54:21] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 10:54:34] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 10:57:07] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 10:57:18] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 11:08:47] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 11:15:07] WARNING Text line "" is empty after train.py:56 transformations
[12/16/22 11:15:21] INFO No explicit validation data provided. train.py:299 Splitting off 6438 (of 32189) samples
to validation set. (Will disable
alphabet mismatch detection.)
INFO Training set 25751 lines, validation train.py:308 set 6438 lines, alphabet 100 symbols
INFO grapheme count train.py:319 INFO SPACE 191642 train.py:324 INFO e 135169 train.py:324 INFO i 121409 train.py:324 INFO a 102716 train.py:324 INFO t 86202 train.py:324 INFO n 85035 train.py:324 INFO s 81056 train.py:324 INFO r 77410 train.py:324 INFO o 76629 train.py:324 INFO u 68132 train.py:324 INFO m 50032 train.py:324 INFO c 41153 train.py:324 INFO l 39492 train.py:324 INFO d 36762 train.py:324 INFO , 26694 train.py:324 INFO p 24221 train.py:324 INFO b 13685 train.py:324 INFO v 13234 train.py:324 INFO f 10970 train.py:324 INFO g 10831 train.py:324 INFO . 9800 train.py:324 INFO q 9215 train.py:324 INFO h 7981 train.py:324 INFO I 7348 train.py:324 INFO S 6279 train.py:324 INFO z 6186 train.py:324 INFO E 5479 train.py:324 INFO ; 4969 train.py:324 INFO x 4435 train.py:324 INFO C 4118 train.py:324 INFO M 3814 train.py:324 INFO D 3589 train.py:324 INFO A 3436 train.py:324 INFO G 2903 train.py:324 INFO F 2530 train.py:324 INFO V 2480 train.py:324 INFO X 2355 train.py:324 INFO L 2164 train.py:324 INFO B 1845 train.py:324 INFO P 1838 train.py:324 INFO T 1365 train.py:324 INFO N 1234 train.py:324 INFO R 1073 train.py:324 INFO : 1033 train.py:324 INFO O 950 train.py:324 INFO y 820 train.py:324 INFO ' 648 train.py:324 INFO j 638 train.py:324 INFO H 510 train.py:324 INFO COMBINING ACUTE ACCENT 461 train.py:324 INFO Q 299 train.py:324 INFO U 290 train.py:324 INFO k 285 train.py:324 INFO + 277 train.py:324 INFO ’ 208 train.py:324 INFO 168 train.py:324 INFO Z 167 train.py:324 INFO J 149 train.py:324 INFO COMBINING GRAVE ACCENT 142 train.py:324 INFO COMBINING TILDE 141 train.py:324 INFO " 100 train.py:324 INFO COMBINING MACRON 64 train.py:324 INFO COMBINING CEDILLA 63 train.py:324 INFO ° 58 train.py:324 INFO / 42 train.py:324 INFO COMBINING OGONEK 41 train.py:324 INFO & 39 train.py:324 INFO W 36 train.py:324 INFO K 26 train.py:324 INFO ꝑ 24 train.py:324 INFO ? 22 train.py:324 INFO đ 21 train.py:324 INFO 4 20 train.py:324 INFO 2 20 train.py:324 INFO 3 20 train.py:324 INFO 1 20 train.py:324 INFO ꝓ 20 train.py:324 INFO w 19 train.py:324 INFO 5 17 train.py:324 INFO 6 16 train.py:324 INFO [ 12 train.py:324 INFO ] 12 train.py:324 INFO æ 11 train.py:324 INFO Y 9 train.py:324 INFO 9 8 train.py:324 INFO > 8 train.py:324 INFO 7 7 train.py:324 INFO 8 7 train.py:324 INFO 0 7 train.py:324 INFO - 5 train.py:324 INFO ( 5 train.py:324 INFO ) 5 train.py:324 INFO • 4 train.py:324 INFO COMBINING CIRCUMFLEX ACCENT 4 train.py:324 INFO ꝝ 4 train.py:324 INFO ꝙ 4 train.py:324 INFO COMBINING DIAERESIS 3 train.py:324 INFO ꝛ 2 train.py:324 INFO < 1 train.py:324 INFO # 1 train.py:324 INFO Encoding training set train.py:337 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch.. [12/16/22 11:15:38] INFO Creating new model [1,120,0,1 train.py:475 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32
Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2
Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200
Do0.1,2 Lbx200 Do.1,2 Lbx200 Do] with
101 outputs
[12/16/22 11:15:54] INFO Setting seg_type to baselines. train.py:495 LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/call.py:38 in _call_and_handle_interrupt │ │ │ │ 35 │ │ if trainer.strategy.launcher is not None: │ │ 36 │ │ │ return trainer.strategy.launcher.launch(trainer_fn,
args, │ │ 37 │ │ else: │ │ ❱ 38 │ │ │ return trainer_fn(*args, kwargs) │ │ 39 │ │ │ 40 │ except _TunerExitException: │ │ 41 │ │ trainer._call_teardown_hook() │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/trainer.py:624 in _fit_impl │ │ │ │ 621 │ │ │ model_provided=True, │ │ 622 │ │ │ model_connected=self.lightning_module is not None, │ │ 623 │ │ ) │ │ ❱ 624 │ │ self._run(model, ckpt_path=self.ckpt_path) │ │ 625 │ │ │ │ 626 │ │ assert self.state.stopped │ │ 627 │ │ self.training = False │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/trainer.py:1046 in _run │ │ │ │ 1043 │ │ │ │ 1044 │ │ # hook │ │ 1045 │ │ if self.state.fn == TrainerFn.FITTING: │ │ ❱ 1046 │ │ │ self._call_callback_hooks("on_fit_start") │ │ 1047 │ │ │ self._call_lightning_module_hook("on_fit_start") │ │ 1048 │ │ │ │ 1049 │ │ self._log_hyperparams() │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/trainer.py:1343 in _call_callback_hooks │ │ │ │ 1340 │ │ │ fn = getattr(callback, hook_name) │ │ 1341 │ │ │ if callable(fn): │ │ 1342 │ │ │ │ with self.profiler.profile(f"[Callback]{callback.stat │ │ ❱ 1343 │ │ │ │ │ fn(self, self.lightning_module, args, kwargs) │ │ 1344 │ │ │ │ 1345 │ │ if pl_module: │ │ 1346 │ │ │ # restore current_fx when nested context │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ca │ │ llbacks/model_summary.py:59 in on_fit_start │ │ │ │ 56 │ │ if not self._max_depth: │ │ 57 │ │ │ return None │ │ 58 │ │ │ │ ❱ 59 │ │ model_summary = self._summary(trainer, pl_module) │ │ 60 │ │ summary_data = model_summary._get_summary_data() │ │ 61 │ │ total_parameters = model_summary.total_parameters │ │ 62 │ │ trainable_parameters = model_summary.trainable_parameters │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ca │ │ llbacks/model_summary.py:73 in _summary │ │ │ │ 70 │ │ │ │ 71 │ │ if isinstance(trainer.strategy, DeepSpeedStrategy) and trainer. │ │ 72 │ │ │ return DeepSpeedSummary(pl_module, max_depth=self._max_dept │ │ ❱ 73 │ │ return summarize(pl_module, max_depth=self._max_depth) │ │ 74 │ │ │ 75 │ @staticmethod │ │ 76 │ def summarize( │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ut │ │ ilities/model_summary/model_summary.py:431 in summarize │ │ │ │ 428 │ Return: │ │ 429 │ │ The model summary object │ │ 430 │ """ │ │ ❱ 431 │ return ModelSummary(lightning_module, max_depth=max_depth) │ │ 432 │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ut │ │ ilities/model_summary/model_summary.py:189 in init │ │ │ │ 186 │ │ │ raise ValueError(f"max_depth can be -1, 0 or > 0, got {m │ │ 187 │ │ │ │ 188 │ │ self._max_depth = max_depth │ │ ❱ 189 │ │ self._layer_summary = self.summarize() │ │ 190 │ │ # 1 byte -> 8 bits │ │ 191 │ │ # TODO: how do we compute precision_megabytes in case of mixed │ │ 192 │ │ precision = self._model.precision if isinstance(self._model.pr │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ut │ │ ilities/model_summary/model_summary.py:246 in summarize │ │ │ │ 243 │ def summarize(self) -> Dict[str, LayerSummary]: │ │ 244 │ │ summary = OrderedDict((name, LayerSummary(module)) for name, m │ │ 245 │ │ if self._model.example_input_array is not None: │ │ ❱ 246 │ │ │ self._forward_example_input() │ │ 247 │ │ for layer in summary.values(): │ │ 248 │ │ │ layer.detach_hook() │ │ 249 │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/ut │ │ ilities/model_summary/model_summary.py:278 in _forward_exampleinput │ │ │ │ 275 │ │ │ elif isinstance(input, dict): │ │ 276 │ │ │ │ model(input) │ │ 277 │ │ │ else: │ │ ❱ 278 │ │ │ │ model(input) │ │ 279 │ │ model.train(mode) # restore mode of module │ │ 280 │ │ │ 281 │ def _get_summary_data(self) -> List[Tuple[str, List[str]]]: │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:1130 in _call_impl │ │ │ │ 1127 │ │ # this function, and just call forward. │ │ 1128 │ │ if not (self._backward_hooks or self._forwardhooks or self. │ │ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks │ │ ❱ 1130 │ │ │ return forward_call(input, kwargs) │ │ 1131 │ │ # Do not call functions when jit is used │ │ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/kraken/lib/train.py: │ │ 368 in forward │ │ │ │ 365 │ │ return dataset │ │ 366 │ │ │ 367 │ def forward(self, x, seq_lens=None): │ │ ❱ 368 │ │ return self.net(x, seq_lens) │ │ 369 │ │ │ 370 │ def training_step(self, batch, batch_idx): │ │ 371 │ │ input, target = batch['image'], batch['target'] │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:1148 in _call_impl │ │ │ │ 1145 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │ │ 1146 │ │ │ input = bw_hook.setup_input_hook(input) │ │ 1147 │ │ │ │ ❱ 1148 │ │ result = forward_call(*input, kwargs) │ │ 1149 │ │ if _global_forward_hooks or self._forward_hooks: │ │ 1150 │ │ │ for hook in (_global_forward_hooks.values(), self._forw │ │ 1151 │ │ │ │ hook_result = hook(self, input, result) │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/kraken/lib/layers.py │ │ :27 in forward │ │ │ │ 24 │ def forward(self, inputs): │ │ 25 │ │ for module in self._modules.values(): │ │ 26 │ │ │ if type(inputs) == tuple: │ │ ❱ 27 │ │ │ │ inputs = module(inputs) │ │ 28 │ │ │ else: │ │ 29 │ │ │ │ inputs = module(inputs) │ │ 30 │ │ return inputs │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:1148 in _call_impl │ │ │ │ 1145 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │ │ 1146 │ │ │ input = bw_hook.setup_input_hook(input) │ │ 1147 │ │ │ │ ❱ 1148 │ │ result = forward_call(*input, *kwargs) │ │ 1149 │ │ if _global_forward_hooks or self._forward_hooks: │ │ 1150 │ │ │ for hook in (_global_forward_hooks.values(), self._forw │ │ 1151 │ │ │ │ hook_result = hook(self, input, result) │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/kraken/lib/layers.py │ │ :774 in forward │ │ │ │ 771 │ │ │ │ │ │ │ │ stride=stride, padding=self.padding) │ │ 772 │ │ │ 773 │ def forward(self, inputs: torch.Tensor, seq_len: Optional[torch.Te │ │ ❱ 774 │ │ o = self.co(inputs) │ │ 775 │ │ # return logits for sigmoid activation during training │ │ 776 │ │ if not (self.nl_name == 'SIGMOID' and self.training): │ │ 777 │ │ │ o = self.nl(o) │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:1148 in _call_impl │ │ │ │ 1145 │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks) │ │ 1146 │ │ │ input = bw_hook.setup_input_hook(input) │ │ 1147 │ │ │ │ ❱ 1148 │ │ result = forward_call(input, kwargs) │ │ 1149 │ │ if _global_forward_hooks or self._forward_hooks: │ │ 1150 │ │ │ for hook in (_global_forward_hooks.values(), self._forw │ │ 1151 │ │ │ │ hook_result = hook(self, input, result) │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/con │ │ v.py:457 in forward │ │ │ │ 454 │ │ │ │ │ │ self.padding, self.dilation, self.groups) │ │ 455 │ │ │ 456 │ def forward(self, input: Tensor) -> Tensor: │ │ ❱ 457 │ │ return self._conv_forward(input, self.weight, self.bias) │ │ 458 │ │ 459 class Conv3d(_ConvNd): │ │ 460 │ doc = r"""Applies a 3D convolution over an input signal compo │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/con │ │ v.py:453 in _conv_forward │ │ │ │ 450 │ │ │ return F.conv2d(F.pad(input, self._reversed_padding_repea │ │ 451 │ │ │ │ │ │ │ weight, bias, self.stride, │ │ 452 │ │ │ │ │ │ │ _pair(0), self.dilation, self.groups) │ │ ❱ 453 │ │ return F.conv2d(input, weight, bias, self.stride, │ │ 454 │ │ │ │ │ │ self.padding, self.dilation, self.groups) │ │ 455 │ │ │ 456 │ def forward(self, input: Tensor) -> Tensor: │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: an illegal memory access was encountered

During handling of the above exception, another exception occurred:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /usr/home/p/pp/.local/bin/ketos:10 in │ │ │ │ 7 │ │ 8 │ │ 9 if name == "main": │ │ ❱ 10 │ sys.exit(cli()) │ │ 11 │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/click/core.py:1130 │ │ in call │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/click/core.py:1055 │ │ in main │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/click/core.py:1657 │ │ in invoke │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/click/core.py:1404 │ │ in invoke │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/click/core.py:760 in │ │ invoke │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/click/decorators.py: │ │ 26 in new_func │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/kraken/ketos/recogni │ │ tion.py:282 in train │ │ │ │ 279 │ │ │ │ │ │ │ deterministic=ctx.meta['deterministic'], │ │ 280 │ │ │ │ │ │ │ *val_check_interval) │ │ 281 │ try: │ │ ❱ 282 │ │ trainer.fit(model) │ │ 283 │ except KrakenInputException as e: │ │ 284 │ │ if e.args[0].startswith('Training data and model codec alphabe │ │ 285 │ │ │ raise click.BadOptionUsage('resize', 'Mismatched training │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/kraken/lib/train.py: │ │ 97 in fit │ │ │ │ 94 │ │ with warnings.catch_warnings(): │ │ 95 │ │ │ warnings.filterwarnings(action='ignore', category=UserWar │ │ 96 │ │ │ │ │ │ │ │ │ message='The dataloader,') │ │ ❱ 97 │ │ │ super().fit(args, **kwargs) │ │ 98 │ │ 99 │ │ 100 class KrakenSetOneChannelMode(Callback): │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/trainer.py:582 in fit │ │ │ │ 579 │ │ if not isinstance(model, pl.LightningModule): │ │ 580 │ │ │ raise TypeError(f"Trainer.fit() requires a LightningMo │ │ 581 │ │ self.strategy._lightning_module = model │ │ ❱ 582 │ │ call._call_and_handle_interrupt( │ │ 583 │ │ │ self, self._fit_impl, model, train_dataloaders, val_datal │ │ 584 │ │ ) │ │ 585 │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/call.py:63 in _call_and_handle_interrupt │ │ │ │ 60 │ │ trainer._call_callback_hooks("on_exception", exception) │ │ 61 │ │ for logger in trainer.loggers: │ │ 62 │ │ │ logger.finalize("failed") │ │ ❱ 63 │ │ trainer._teardown() │ │ 64 │ │ # teardown might access the stage so we reset it after │ │ 65 │ │ trainer.state.stage = None │ │ 66 │ │ raise │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/tr │ │ ainer/trainer.py:1124 in _teardown │ │ │ │ 1121 │ def _teardown(self) -> None: │ │ 1122 │ │ """This is the Trainer's internal teardown, unrelated to the │ │ 1123 │ │ Callback; those are handled by :meth:_call_teardown_hook."" │ │ ❱ 1124 │ │ self.strategy.teardown() │ │ 1125 │ │ loop = self._active_loop │ │ 1126 │ │ # loop should never beNonehere but it can because we don' │ │ 1127 │ │ if loop is not None: │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/pytorch_lightning/st │ │ rategies/strategy.py:496 in teardown │ │ │ │ 493 │ │ │ │ 494 │ │ if self.lightning_module is not None: │ │ 495 │ │ │ log.detail(f"{self.__class__.__name__}: moving model to CP │ │ ❱ 496 │ │ │ self.lightning_module.cpu() │ │ 497 │ │ self.precision_plugin.teardown() │ │ 498 │ │ assert self.accelerator is not None │ │ 499 │ │ self.accelerator.teardown() │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/lightning_lite/utili │ │ ties/device_dtype_mixin.py:78 in cpu │ │ │ │ 75 │ def cpu(self) -> Self: # type: ignore[valid-type] │ │ 76 │ │ """See :meth:torch.nn.Module.cpu.""" │ │ 77 │ │ self.__update_properties(device=torch.device("cpu")) │ │ ❱ 78 │ │ return super().cpu() │ │ 79 │ │ │ 80 │ def type(self, dst_type: Union[str, torch.dtype]) -> Self: # type │ │ 81 │ │ """See :meth:torch.nn.Module.type.""" │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:738 in cpu │ │ │ │ 735 │ │ Returns: │ │ 736 │ │ │ Module: self │ │ 737 │ │ """ │ │ ❱ 738 │ │ return self._apply(lambda t: t.cpu()) │ │ 739 │ │ │ 740 │ def type(self: T, dst_type: Union[dtype, str]) -> T: │ │ 741 │ │ r"""Casts all parameters and buffers to :attr:dst_type. │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:579 in _apply │ │ │ │ 576 │ │ │ 577 │ def _apply(self, fn): │ │ 578 │ │ for module in self.children(): │ │ ❱ 579 │ │ │ module._apply(fn) │ │ 580 │ │ │ │ 581 │ │ def compute_should_use_set_data(tensor, tensor_applied): │ │ 582 │ │ │ if torch._has_compatible_shallow_copy_type(tensor, tensor │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:602 in _apply │ │ │ │ 599 │ │ │ # track autograd history ofparam_applied, so we have t │ │ 600 │ │ │ #with torch.no_grad():│ │ 601 │ │ │ with torch.no_grad(): │ │ ❱ 602 │ │ │ │ param_applied = fn(param) │ │ 603 │ │ │ should_use_set_data = compute_should_use_set_data(param, │ │ 604 │ │ │ if should_use_set_data: │ │ 605 │ │ │ │ param.data = param_applied │ │ │ │ /usr/home/p/pp/.local/lib/python3.8/site-packages/torch/nn/modules/mod │ │ ule.py:738 in <lambda> │ │ │ │ 735 │ │ Returns: │ │ 736 │ │ │ Module: self │ │ 737 │ │ """ │ │ ❱ 738 │ │ return self._apply(lambda t: t.cpu()) │ │ 739 │ │ │ 740 │ def type(self: T, dst_type: Union[dtype, str]) -> T: │ │ 741 │ │ r"""Casts all parameters and buffers to :attr:dst_type`. │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: an illegal memory access was encountered

mittagessen commented 1 year ago

Accessing unallocated memory is a bug inside pytorch that isn't really connected to kraken. Which pytorch version are you using?

You'll definitely see much higher performance with binary datasets in any case though.

particitae commented 1 year ago

pytorch_lightning-1.8.3.post1.dist-info

mittagessen commented 1 year ago

That's just the abstraction library. The actual pytorch version you can get with:

pip list | grep torch
particitae commented 1 year ago

pytorch-lightning 1.8.3.post1 torch 1.12.1+cu116 torchaudio 0.12.1+cu116 torchmetrics 0.11.0 torchvision 0.13.1+cu116

mittagessen commented 1 year ago

Hmm, looks OK (no weird nightly builds or anything like that). You might try to upgrade the torch version to the latest stable 1.13.1 (pip install -U torch) and see if the error disappears.

particitae commented 1 year ago

it runs !!!!

pytorch-lightning 1.8.3.post1 torch 1.13.1 torchaudio 0.12.1+cu116 torchmetrics 0.11.0 torchvision 0.13.1+cu116

Perhaps during the installation, i got this message "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchvision 0.13.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible. torchaudio 0.12.1+cu116 requires torch==1.12.1, but you have torch 1.13.1 which is incompatible." and decided to downgrad the torch library. thanks