mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
750 stars 131 forks source link

Cannot train with Ketos as CUDA too old #658

Closed ask6155 closed 1 week ago

ask6155 commented 2 weeks ago

Hello,

I have an old GPU and my CUDA version is 11.4. I think it is not supported by current PyTorch implementation. I'm running Kraken 5.2.9 and the command I'm running as a test is ketos train -d cpu -f alto Kraken\ test\ 1.xml The output log is:

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
┏━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name      ┃ Type                  ┃ Params ┃              In sizes ┃             Out sizes ┃
┡━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ val_cer   │ CharErrorRate         │      0 │                     ? │                     ? │
│ 1  │ val_wer   │ WordErrorRate         │      0 │                     ? │                     ? │
│ 2  │ net       │ MultiParamSequential  │  4.0 M │    [[1, 1, 120, 400], │ [[1, 40, 1, 50], '?'] │
│    │           │                       │        │                  '?'] │                       │
│ 3  │ net.C_0   │ ActConv2D             │  1.3 K │    [[1, 1, 120, 400], │   [[1, 32, 120, 400], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 4  │ net.Do_1  │ Dropout               │      0 │   [[1, 32, 120, 400], │   [[1, 32, 120, 400], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 5  │ net.Mp_2  │ MaxPool               │      0 │   [[1, 32, 120, 400], │    [[1, 32, 60, 200], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 6  │ net.C_3   │ ActConv2D             │ 40.0 K │    [[1, 32, 60, 200], │    [[1, 32, 60, 200], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 7  │ net.Do_4  │ Dropout               │      0 │    [[1, 32, 60, 200], │    [[1, 32, 60, 200], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 8  │ net.Mp_5  │ MaxPool               │      0 │    [[1, 32, 60, 200], │    [[1, 32, 30, 100], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 9  │ net.C_6   │ ActConv2D             │ 55.4 K │    [[1, 32, 30, 100], │    [[1, 64, 30, 100], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 10 │ net.Do_7  │ Dropout               │      0 │    [[1, 64, 30, 100], │    [[1, 64, 30, 100], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 11 │ net.Mp_8  │ MaxPool               │      0 │    [[1, 64, 30, 100], │     [[1, 64, 15, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 12 │ net.C_9   │ ActConv2D             │  110 K │     [[1, 64, 15, 50], │     [[1, 64, 15, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 13 │ net.Do_10 │ Dropout               │      0 │     [[1, 64, 15, 50], │     [[1, 64, 15, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 14 │ net.S_11  │ Reshape               │      0 │     [[1, 64, 15, 50], │     [[1, 960, 1, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 15 │ net.L_12  │ TransposedSummarizin… │  1.9 M │     [[1, 960, 1, 50], │     [[1, 400, 1, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 16 │ net.Do_13 │ Dropout               │      0 │     [[1, 400, 1, 50], │     [[1, 400, 1, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 17 │ net.L_14  │ TransposedSummarizin… │  963 K │     [[1, 400, 1, 50], │     [[1, 400, 1, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 18 │ net.Do_15 │ Dropout               │      0 │     [[1, 400, 1, 50], │     [[1, 400, 1, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 19 │ net.L_16  │ TransposedSummarizin… │  963 K │     [[1, 400, 1, 50], │     [[1, 400, 1, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 20 │ net.Do_17 │ Dropout               │      0 │     [[1, 400, 1, 50], │     [[1, 400, 1, 50], │
│    │           │                       │        │             '?', '?'] │                  '?'] │
│ 21 │ net.O_18  │ LinSoftmax            │ 16.0 K │     [[1, 400, 1, 50], │ [[1, 40, 1, 50], '?'] │
│    │           │                       │        │             '?', '?'] │                       │
└────┴───────────┴───────────────────────┴────────┴───────────────────────┴───────────────────────┘
Trainable params: 4.0 M                                                                            
Non-trainable params: 0                                                                            
Total params: 4.0 M                                                                                
Total estimated model params size (MB): 16                                                         
╭─────────────────────────────── Traceback (most recent call last) ───────────────────────────────╮
│ main_folder/kraken/real/bin/ketos:8 in <module>                                      │
│                                                                                                 │
│   5 from kraken.ketos import cli                                                                │
│   6 if __name__ == '__main__':                                                                  │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                        │
│ ❱ 8 │   sys.exit(cli())                                                                         │
│   9                                                                                             │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/click/core.py:1157 in __call__  │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/click/core.py:1078 in main      │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/click/core.py:1688 in invoke    │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/click/core.py:1434 in invoke    │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/click/core.py:783 in invoke     │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/click/decorators.py:33 in       │
│ new_func                                                                                        │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/kraken/ketos/recognition.py:329 │
│ in train                                                                                        │
│                                                                                                 │
│   326 │   │   │   │   │   │   │   **val_check_interval)                                         │
│   327 │   try:                                                                                  │
│   328 │   │   with threadpool_limits(limits=threads):                                           │
│ ❱ 329 │   │   │   trainer.fit(model)                                                            │
│   330 │   except KrakenInputException as e:                                                     │
│   331 │   │   if e.args[0].startswith('Training data and model codec alphabets mismatch') and   │
│   332 │   │   │   raise click.BadOptionUsage('resize', 'Mismatched training data for loaded mo  │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/kraken/lib/train.py:129 in fit  │
│                                                                                                 │
│    126 │   │   with warnings.catch_warnings():                                                  │
│    127 │   │   │   warnings.filterwarnings(action='ignore', category=UserWarning,               │
│    128 │   │   │   │   │   │   │   │   │   message='The dataloader,')                           │
│ ❱  129 │   │   │   super().fit(*args, **kwargs)                                                 │
│    130                                                                                          │
│    131                                                                                          │
│    132 class KrakenFreezeBackbone(BaseFinetuning):                                              │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/lightning/pytorch/trainer/train │
│ er.py:544 in fit                                                                                │
│                                                                                                 │
│    541 │   │   self.state.fn = TrainerFn.FITTING                                                │
│    542 │   │   self.state.status = TrainerStatus.RUNNING                                        │
│    543 │   │   self.training = True                                                             │
│ ❱  544 │   │   call._call_and_handle_interrupt(                                                 │
│    545 │   │   │   self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, │
│    546 │   │   )                                                                                │
│    547                                                                                          │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/lightning/pytorch/trainer/call. │
│ py:44 in _call_and_handle_interrupt                                                             │
│                                                                                                 │
│    41 │   try:                                                                                  │
│    42 │   │   if trainer.strategy.launcher is not None:                                         │
│    43 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer,   │
│ ❱  44 │   │   return trainer_fn(*args, **kwargs)                                                │
│    45 │                                                                                         │
│    46 │   except _TunerExitException:                                                           │
│    47 │   │   _call_teardown_hook(trainer)                                                      │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/lightning/pytorch/trainer/train │
│ er.py:580 in _fit_impl                                                                          │
│                                                                                                 │
│    577 │   │   │   model_provided=True,                                                         │
│    578 │   │   │   model_connected=self.lightning_module is not None,                           │
│    579 │   │   )                                                                                │
│ ❱  580 │   │   self._run(model, ckpt_path=ckpt_path)                                            │
│    581 │   │                                                                                    │
│    582 │   │   assert self.state.stopped                                                        │
│    583 │   │   self.training = False                                                            │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/lightning/pytorch/trainer/train │
│ er.py:987 in _run                                                                               │
│                                                                                                 │
│    984 │   │   # ----------------------------                                                   │
│    985 │   │   # RUN THE TRAINER                                                                │
│    986 │   │   # ----------------------------                                                   │
│ ❱  987 │   │   results = self._run_stage()                                                      │
│    988 │   │                                                                                    │
│    989 │   │   # ----------------------------                                                   │
│    990 │   │   # POST-Training CLEAN UP                                                         │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/lightning/pytorch/trainer/train │
│ er.py:1030 in _run_stage                                                                        │
│                                                                                                 │
│   1027 │   │   if self.predicting:                                                              │
│   1028 │   │   │   return self.predict_loop.run()                                               │
│   1029 │   │   if self.training:                                                                │
│ ❱ 1030 │   │   │   with isolate_rng():                                                          │
│   1031 │   │   │   │   self._run_sanity_check()                                                 │
│   1032 │   │   │   with torch.autograd.set_detect_anomaly(self._detect_anomaly):                │
│   1033 │   │   │   │   self.fit_loop.run()                                                      │
│                                                                                                 │
│ main_folder/kraken/python/lib/python3.11/contextlib.py:137 in __enter__              │
│                                                                                                 │
│   134 │   │   # they are only needed for recreation, which is not possible anymore              │
│   135 │   │   del self.args, self.kwds, self.func                                               │
│   136 │   │   try:                                                                              │
│ ❱ 137 │   │   │   return next(self.gen)                                                         │
│   138 │   │   except StopIteration:                                                             │
│   139 │   │   │   raise RuntimeError("generator didn't yield") from None                        │
│   140                                                                                           │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/lightning/pytorch/utilities/see │
│ d.py:44 in isolate_rng                                                                          │
│                                                                                                 │
│   41 │   │   tensor([0.7576])                                                                   │
│   42 │                                                                                          │
│   43 │   """                                                                                    │
│ ❱ 44 │   states = _collect_rng_states(include_cuda)                                             │
│   45 │   yield                                                                                  │
│   46 │   _set_rng_states(states)                                                                │
│   47                                                                                            │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/lightning/fabric/utilities/seed │
│ .py:113 in _collect_rng_states                                                                  │
│                                                                                                 │
│   110 │   │   "python": python_get_rng_state(),                                                 │
│   111 │   }                                                                                     │
│   112 │   if include_cuda:                                                                      │
│ ❱ 113 │   │   states["torch.cuda"] = torch.cuda.get_rng_state_all() if torch.cuda.is_available  │
│   114 │   return states                                                                         │
│   115                                                                                           │
│   116                                                                                           │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/torch/cuda/random.py:47 in      │
│ get_rng_state_all                                                                               │
│                                                                                                 │
│    44 │                                                                                         │
│    45 │   results = []                                                                          │
│    46 │   for i in range(device_count()):                                                       │
│ ❱  47 │   │   results.append(get_rng_state(i))                                                  │
│    48 │   return results                                                                        │
│    49                                                                                           │
│    50                                                                                           │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/torch/cuda/random.py:30 in      │
│ get_rng_state                                                                                   │
│                                                                                                 │
│    27 │   .. warning::                                                                          │
│    28 │   │   This function eagerly initializes CUDA.                                           │
│    29 │   """                                                                                   │
│ ❱  30 │   _lazy_init()                                                                          │
│    31 │   if isinstance(device, str):                                                           │
│    32 │   │   device = torch.device(device)                                                     │
│    33 │   elif isinstance(device, int):                                                         │
│                                                                                                 │
│ main_folder/kraken/real/lib/python3.11/site-packages/torch/cuda/__init__.py:298 in   │
│ _lazy_init                                                                                      │
│                                                                                                 │
│    295 │   │   # are found or any other error occurs                                            │
│    296 │   │   if "CUDA_MODULE_LOADING" not in os.environ:                                      │
│    297 │   │   │   os.environ["CUDA_MODULE_LOADING"] = "LAZY"                                   │
│ ❱  298 │   │   torch._C._cuda_init()                                                            │
│    299 │   │   # Some of the queued calls may reentrantly call _lazy_init();                    │
│    300 │   │   # we need to just return without initializing in that case.                      │
│    301 │   │   # However, we must not let any *other* threads in!                               │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The NVIDIA driver on your system is too old (found version 11040). Please update your
GPU driver by downloading and installing a new version from the URL: 
http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a 
PyTorch version that has been compiled with your version of the CUDA driver.

Now I don't really want to use CUDA if it's not possible. But for some reason I'm unable to make it use my CPU. I tried the -d cpu flag but it doesn't seem to work. Please guide me

mittagessen commented 2 weeks ago

It's not a CUDA issue as such but the version of your GPU driver. Are you sure there isn't a current version available for it? Alternatively, you can install a CPU-only pytorch. If you use anaconda there is an environment file in the kraken repository `environment.yml' that installs without CUDA support. If you install with standard python tools just run:

pip3 install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

and ignore the warnings and everything will probably work.

ask6155 commented 2 weeks ago

on Pytorch's website the version of pytorch that supports CUDA 11.4 is 1.12.1 I dunno if Kraken will work with it.

I ran the command and got an error: ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. kraken 5.2.9 requires torch~=2.1.0, but you have torch 2.5.1+cpu which is incompatible. But I'm able to train so I think it's okay.

mittagessen commented 2 weeks ago

On 24/11/07 08:05AM, ask6155 wrote:

on Pytorch's website the version of pytorch that supports CUDA 11.4 is 1.12.1 I dunno if Kraken will work with it.

You can install a pytorch without CUDA support compiled from the website. Just select CPU as compute platform.

I ran the command and got an error: ERROR: pip&#39;s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. kraken 5.2.9 requires torch~=2.1.0, but you have torch 2.5.1+cpu which is incompatible. But I'm able to train so I think it's okay.

Yes, as I said pip will throw warnings at you but everything probably still works. If something doesn work as expected you can install CPU-only version 2.1.x of pytorch following the instructions here [0].

[0] https://pytorch.org/get-started/previous-versions/#v212