mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
750 stars 131 forks source link

Error trying to train segmenter on PAGE xml files exported from escriptorium #448

Closed kabikaj closed 1 year ago

kabikaj commented 1 year ago

I segmented and transcribed 6 jpg images in a rtl script with only one region of text in escriptorium and exported them in PAGE format. This is the result of one of the xml files:

<?xml version="1.0" encoding="UTF-8"  standalone="yes"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
  <Metadata>
    <Creator>escriptorium</Creator>
    <Created>2023-02-14T13:02:43.300784+00:00</Created>
        <LastChange>2023-02-14T13:02:43.302566+00:00</LastChange>

  </Metadata>
  <Page imageFilename="f5.jpg" imageWidth="5781" imageHeight="7169">

    <TextRegion id="eSc_dummyblock_">

      <TextLine id="eSc_line_e823541c" >
        <Coords points="460,1260 460,1314 848,1356 854,1356 860,1356 866,1356 932,1314 950,1302 1434,1302 1464,1314 1542,1350 1548,1350 1554,1350 1560,1350 1644,1314 1703,1356 1709,1356 1715,1356 1721,1362 2158,1362 2158,1356 2164,1356 2259,1308 2265,1302 2696,1302 2720,1308 2797,1338 2803,1338 2809,1338 3001,1308 3078,1296 3108,1308 3192,1338 3198,1338 3204,1338 3557,1302 3670,1296 3724,1302 3951,1350 3957,1350 3963,1350 4125,1308 4561,1350 4567,1350 4800,1302 4866,1344 4872,1344 4878,1344 5248,1350 5254,1350 5260,1350 5404,1296 5404,1248 5392,1093 4997,1087 4985,1081 4513,1009 4507,1009 4370,1021 4364,1021 4358,1021 4286,1069 4256,1093 4041,1093 3676,1069 3658,1057 3539,991 3533,991 3168,991 3162,991 3156,991 3007,1039 2893,997 2887,997 2427,997 2421,997 2415,997 2343,1033 2229,1099 1817,1099 1608,1075 1536,1021 1506,997 1500,997 1494,997 1488,997 1339,1015 1249,1027 1165,1015 1070,997 1064,997 1058,997 1022,1009 956,1039 908,1009 890,997 884,997 878,997 872,997 807,1009 651,1033 454,1003 460,1260"/>
        <Baseline points="466,1261 5405,1253"/>
        <TextEquiv>
          <Unicode>GGGBM FBMA LKM BH ELM FLM BGA GW N FBMA LBS LKM</Unicode>
        </TextEquiv>
      </TextLine>

      <TextLine id="eSc_line_302c3c9f" >
        <Coords points="490,1463 484,1553 717,1577 723,1577 1123,1553 1554,1529 1626,1553 1709,1583 1715,1583 1721,1589 2116,1589 2116,1583 2313,1553 2409,1577 2415,1577 2421,1577 2427,1577 2433,1577 2469,1547 2492,1535 2552,1547 2654,1577 2660,1577 2666,1577 2672,1577 2779,1547 2797,1541 3437,1541 3461,1547 3527,1571 3533,1571 3539,1571 3802,1547 3873,1541 3897,1547 3999,1577 4005,1577 4011,1577 4208,1547 4214,1547 4226,1547 5284,1595 5290,1595 5296,1595 5392,1541 5398,1487 5392,1356 4944,1356 4938,1350 4848,1284 4842,1284 4836,1284 4627,1284 4621,1284 4615,1284 4537,1326 4435,1284 4429,1284 4160,1278 4154,1278 4148,1278 3981,1344 3748,1278 3742,1278 3569,1278 3563,1278 3557,1278 3551,1278 3455,1338 3156,1278 3150,1278 2929,1284 2923,1284 2785,1338 2690,1272 2684,1272 2678,1272 2504,1272 2498,1272 2492,1272 2486,1272 2427,1320 2283,1272 2277,1272 2271,1272 2265,1272 2086,1338 2074,1344 1769,1344 1757,1338 1661,1266 1655,1266 1650,1266 1644,1266 1638,1266 1632,1266 1530,1332 1524,1338 1416,1338 1410,1332 1309,1266 1303,1266 1297,1266 1010,1266 1004,1266 998,1266 992,1266 908,1332 902,1332 490,1332 490,1463"/>
        <Baseline points="494,1468 5400,1488"/>
        <TextEquiv>
          <Unicode>ELM W A LLH BELM W A BBM LA BELMW N MA KA N A BR</Unicode>
        </TextEquiv>
      </TextLine>
(...)

    </TextRegion>    
  </Page>
</PcGts>

Then, I wanted to train a segmenter with this images, but I am getting an error. This is the command I used:

$ ketos segtrain -f page tmp_train/*.xml

And this is the error I get:

[02/20/23 14:01:02] WARNING  Region eSc_dummyblock_ without coordinates                                                                                                                     xml.py:242
                    WARNING  Region eSc_dummyblock_ without coordinates                                                                                                                     xml.py:242
                    WARNING  Region eSc_dummyblock_ without coordinates                                                                                                                     xml.py:242
                    WARNING  Region eSc_dummyblock_ without coordinates                                                                                                                     xml.py:242
                    WARNING  Region eSc_dummyblock_ without coordinates                                                                                                                     xml.py:242
                    WARNING  Region eSc_dummyblock_ without coordinates                                                                                                                     xml.py:242
Training line types:
  default   2   109
Training region types:
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
┏━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃    ┃ Name      ┃ Type                     ┃ Params ┃
┡━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0  │ net       │ MultiParamSequential     │  1.3 M │
│ 1  │ net.C_0   │ ActConv2D                │  9.5 K │
│ 2  │ net.Gn_1  │ GroupNorm                │    128 │
│ 3  │ net.C_2   │ ActConv2D                │ 73.9 K │
│ 4  │ net.Gn_3  │ GroupNorm                │    256 │
│ 5  │ net.C_4   │ ActConv2D                │  147 K │
│ 6  │ net.Gn_5  │ GroupNorm                │    256 │
│ 7  │ net.C_6   │ ActConv2D                │  295 K │
│ 8  │ net.Gn_7  │ GroupNorm                │    512 │
│ 9  │ net.C_8   │ ActConv2D                │  590 K │
│ 10 │ net.Gn_9  │ GroupNorm                │    512 │
│ 11 │ net.L_10  │ TransposedSummarizingRNN │ 74.2 K │
│ 12 │ net.L_11  │ TransposedSummarizingRNN │ 25.1 K │
│ 13 │ net.C_12  │ ActConv2D                │  2.1 K │
│ 14 │ net.Gn_13 │ GroupNorm                │     64 │
│ 15 │ net.L_14  │ TransposedSummarizingRNN │ 16.9 K │
│ 16 │ net.L_15  │ TransposedSummarizingRNN │ 25.1 K │
│ 17 │ net.l_16  │ ActConv2D                │    195 │
└────┴───────────┴──────────────────────────┴────────┘
Trainable params: 1.3 M                                                                                                                                                                               
Non-trainable params: 0                                                                                                                                                                               
Total params: 1.3 M                                                                                                                                                                                   
Total estimated model params size (MB): 5                                                                                                                                                             
Validation Sanity Check ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/1 -:--:-- 0:03:54  
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/alicia/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:1011 in        │
│ _try_get_data                                                                                    │
│                                                                                                  │
│   1008 │   │   # Returns a 2-tuple:                                                              │
│   1009 │   │   #   (bool: whether successfully get data, any: data if successful else None)      │
│   1010 │   │   try:                                                                              │
│ ❱ 1011 │   │   │   data = self._data_queue.get(timeout=timeout)                                  │
│   1012 │   │   │   return (True, data)                                                           │
│   1013 │   │   except Exception as e:                                                            │
│   1014 │   │   │   # At timeout and error, we manually check whether any worker has              │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/multiprocessing/queues.py:113 in get                        │
│                                                                                                  │
│   110 │   │   │   try:                                                                           │
│   111 │   │   │   │   if block:                                                                  │
│   112 │   │   │   │   │   timeout = deadline - time.monotonic()                                  │
│ ❱ 113 │   │   │   │   │   if not self._poll(timeout):                                            │
│   114 │   │   │   │   │   │   raise Empty                                                        │
│   115 │   │   │   │   elif not self._poll():                                                     │
│   116 │   │   │   │   │   raise Empty                                                            │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/multiprocessing/connection.py:262 in poll                   │
│                                                                                                  │
│   259 │   │   """Whether there is any input available to be read"""                              │
│   260 │   │   self._check_closed()                                                               │
│   261 │   │   self._check_readable()                                                             │
│ ❱ 262 │   │   return self._poll(timeout)                                                         │
│   263 │                                                                                          │
│   264 │   def __enter__(self):                                                                   │
│   265 │   │   return self                                                                        │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/multiprocessing/connection.py:429 in _poll                  │
│                                                                                                  │
│   426 │   │   return self._recv(size)                                                            │
│   427 │                                                                                          │
│   428 │   def _poll(self, timeout):                                                              │
│ ❱ 429 │   │   r = wait([self], timeout)                                                          │
│   430 │   │   return bool(r)                                                                     │
│   431                                                                                            │
│   432                                                                                            │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/multiprocessing/connection.py:936 in wait                   │
│                                                                                                  │
│   933 │   │   │   │   deadline = time.monotonic() + timeout                                      │
│   934 │   │   │                                                                                  │
│   935 │   │   │   while True:                                                                    │
│ ❱ 936 │   │   │   │   ready = selector.select(timeout)                                           │
│   937 │   │   │   │   if ready:                                                                  │
│   938 │   │   │   │   │   return [key.fileobj for (key, events) in ready]                        │
│   939 │   │   │   │   else:                                                                      │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/selectors.py:416 in select                                  │
│                                                                                                  │
│   413 │   │   │   timeout = math.ceil(timeout * 1e3)                                             │
│   414 │   │   ready = []                                                                         │
│   415 │   │   try:                                                                               │
│ ❱ 416 │   │   │   fd_event_list = self._selector.poll(timeout)                                   │
│   417 │   │   except InterruptedError:                                                           │
│   418 │   │   │   return ready                                                                   │
│   419 │   │   for fd, event in fd_event_list:                                                    │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py:66 │
│ in handler                                                                                       │
│                                                                                                  │
│   63 │   def handler(signum, frame):                                                             │
│   64 │   │   # This following call uses `waitid` with WNOHANG from C side. Therefore,            │
│   65 │   │   # Python can still get and update the process status successfully.                  │
│ ❱ 66 │   │   _error_if_any_worker_fails()                                                        │
│   67 │   │   if previous_handler is not None:                                                    │
│   68 │   │   │   assert callable(previous_handler)                                               │
│   69 │   │   │   previous_handler(signum, frame)                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: DataLoader worker (pid 5386) is killed by signal: Killed. 

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/alicia/anaconda3/bin/ketos:8 in <module>                                                   │
│                                                                                                  │
│   5 from kraken.ketos import cli                                                                 │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/click/core.py:1130 in __call__                │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/click/core.py:1055 in main                    │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/click/core.py:1657 in invoke                  │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/click/core.py:1404 in invoke                  │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/click/core.py:760 in invoke                   │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/click/decorators.py:26 in new_func            │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/kraken/ketos/segmentation.py:319 in segtrain  │
│                                                                                                  │
│   316 │   │   │   │   │   │   │   precision=int(precision),                                      │
│   317 │   │   │   │   │   │   │   **val_check_interval)                                          │
│   318 │                                                                                          │
│ ❱ 319 │   trainer.fit(model)                                                                     │
│   320 │                                                                                          │
│   321 │   if quit == 'early':                                                                    │
│   322 │   │   message('Moving best model {0}_{1}.mlmodel ({2}) to {0}_best.mlmodel'.format(      │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/kraken/lib/train.py:95 in fit                 │
│                                                                                                  │
│    92 │   │   with warnings.catch_warnings():                                                    │
│    93 │   │   │   warnings.filterwarnings(action='ignore', category=UserWarning,                 │
│    94 │   │   │   │   │   │   │   │   │   message='The dataloader,')                             │
│ ❱  95 │   │   │   super().fit(*args, **kwargs)                                                   │
│    96                                                                                            │
│    97                                                                                            │
│    98 class KrakenSetOneChannelMode(Callback):                                                   │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:608 in   │
│ fit                                                                                              │
│                                                                                                  │
│    605 │   │   if not isinstance(model, pl.LightningModule):                                     │
│    606 │   │   │   raise TypeError(f"`Trainer.fit()` requires a `LightningModule`, got: {model.  │
│    607 │   │   self.strategy._lightning_module = model                                           │
│ ❱  608 │   │   call._call_and_handle_interrupt(                                                  │
│    609 │   │   │   self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule,  │
│    610 │   │   )                                                                                 │
│    611                                                                                           │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:38 in       │
│ _call_and_handle_interrupt                                                                       │
│                                                                                                  │
│   35 │   │   if trainer.strategy.launcher is not None:                                           │
│   36 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer,     │
│   37 │   │   else:                                                                               │
│ ❱ 38 │   │   │   return trainer_fn(*args, **kwargs)                                              │
│   39 │                                                                                           │
│   40 │   except _TunerExitException:                                                             │
│   41 │   │   trainer._call_teardown_hook()                                                       │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:650 in   │
│ _fit_impl                                                                                        │
│                                                                                                  │
│    647 │   │   │   model_provided=True,                                                          │
│    648 │   │   │   model_connected=self.lightning_module is not None,                            │
│    649 │   │   )                                                                                 │
│ ❱  650 │   │   self._run(model, ckpt_path=self.ckpt_path)                                        │
│    651 │   │                                                                                     │
│    652 │   │   assert self.state.stopped                                                         │
│    653 │   │   self.training = False                                                             │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1103 in  │
│ _run                                                                                             │
│                                                                                                  │
│   1100 │   │                                                                                     │
│   1101 │   │   self._checkpoint_connector.resume_end()                                           │
│   1102 │   │                                                                                     │
│ ❱ 1103 │   │   results = self._run_stage()                                                       │
│   1104 │   │                                                                                     │
│   1105 │   │   log.detail(f"{self.__class__.__name__}: trainer tearing down")                    │
│   1106 │   │   self._teardown()                                                                  │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1182 in  │
│ _run_stage                                                                                       │
│                                                                                                  │
│   1179 │   │   │   return self._run_evaluate()                                                   │
│   1180 │   │   if self.predicting:                                                               │
│   1181 │   │   │   return self._run_predict()                                                    │
│ ❱ 1182 │   │   self._run_train()                                                                 │
│   1183 │                                                                                         │
│   1184 │   def _pre_training_routine(self) -> None:                                              │
│   1185 │   │   # wait for all to join if on distributed                                          │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1195 in  │
│ _run_train                                                                                       │
│                                                                                                  │
│   1192 │   │   self._pre_training_routine()                                                      │
│   1193 │   │                                                                                     │
│   1194 │   │   with isolate_rng():                                                               │
│ ❱ 1195 │   │   │   self._run_sanity_check()                                                      │
│   1196 │   │                                                                                     │
│   1197 │   │   # enable train mode                                                               │
│   1198 │   │   assert self.model is not None                                                     │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1267 in  │
│ _run_sanity_check                                                                                │
│                                                                                                  │
│   1264 │   │   │                                                                                 │
│   1265 │   │   │   # run eval step                                                               │
│   1266 │   │   │   with torch.no_grad():                                                         │
│ ❱ 1267 │   │   │   │   val_loop.run()                                                            │
│   1268 │   │   │                                                                                 │
│   1269 │   │   │   self._call_callback_hooks("on_sanity_check_end")                              │
│   1270                                                                                           │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py:199 in run    │
│                                                                                                  │
│   196 │   │   while not self.done:                                                               │
│   197 │   │   │   try:                                                                           │
│   198 │   │   │   │   self.on_advance_start(*args, **kwargs)                                     │
│ ❱ 199 │   │   │   │   self.advance(*args, **kwargs)                                              │
│   200 │   │   │   │   self.on_advance_end()                                                      │
│   201 │   │   │   │   self._restarting = False                                                   │
│   202 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation │
│ _loop.py:152 in advance                                                                          │
│                                                                                                  │
│   149 │   │   kwargs = OrderedDict()                                                             │
│   150 │   │   if self.num_dataloaders > 1:                                                       │
│   151 │   │   │   kwargs["dataloader_idx"] = dataloader_idx                                      │
│ ❱ 152 │   │   dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)       │
│   153 │   │                                                                                      │
│   154 │   │   # store batch level output per dataloader                                          │
│   155 │   │   self._outputs.append(dl_outputs)                                                   │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py:199 in run    │
│                                                                                                  │
│   196 │   │   while not self.done:                                                               │
│   197 │   │   │   try:                                                                           │
│   198 │   │   │   │   self.on_advance_start(*args, **kwargs)                                     │
│ ❱ 199 │   │   │   │   self.advance(*args, **kwargs)                                              │
│   200 │   │   │   │   self.on_advance_end()                                                      │
│   201 │   │   │   │   self._restarting = False                                                   │
│   202 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoc │
│ h_loop.py:121 in advance                                                                         │
│                                                                                                  │
│   118 │   │   """                                                                                │
│   119 │   │   if not isinstance(data_fetcher, DataLoaderIterDataFetcher):                        │
│   120 │   │   │   batch_idx = self.batch_progress.current.ready                                  │
│ ❱ 121 │   │   │   batch = next(data_fetcher)                                                     │
│   122 │   │   else:                                                                              │
│   123 │   │   │   batch_idx, batch = next(data_fetcher)                                          │
│   124 │   │   self.batch_progress.is_last_batch = data_fetcher.done                              │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py:184   │
│ in __next__                                                                                      │
│                                                                                                  │
│   181 │   │   return self                                                                        │
│   182 │                                                                                          │
│   183 │   def __next__(self) -> Any:                                                             │
│ ❱ 184 │   │   return self.fetching_function()                                                    │
│   185 │                                                                                          │
│   186 │   def reset(self) -> None:                                                               │
│   187 │   │   self.fetched = 0                                                                   │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py:265   │
│ in fetching_function                                                                             │
│                                                                                                  │
│   262 │   │   elif not self.done:                                                                │
│   263 │   │   │   # this will run only when no pre-fetching was done.                            │
│   264 │   │   │   try:                                                                           │
│ ❱ 265 │   │   │   │   self._fetch_next_batch(self.dataloader_iter)                               │
│   266 │   │   │   │   # consume the batch we just fetched                                        │
│   267 │   │   │   │   batch = self.batches.pop(0)                                                │
│   268 │   │   │   except StopIteration as e:                                                     │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py:280   │
│ in _fetch_next_batch                                                                             │
│                                                                                                  │
│   277 │   def _fetch_next_batch(self, iterator: Iterator) -> None:                               │
│   278 │   │   start_output = self.on_fetch_start()                                               │
│   279 │   │   try:                                                                               │
│ ❱ 280 │   │   │   batch = next(iterator)                                                         │
│   281 │   │   except StopIteration as e:                                                         │
│   282 │   │   │   self._stop_profiler()                                                          │
│   283 │   │   │   raise e                                                                        │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:530 in         │
│ __next__                                                                                         │
│                                                                                                  │
│    527 │   │   with torch.autograd.profiler.record_function(self._profile_name):                 │
│    528 │   │   │   if self._sampler_iter is None:                                                │
│    529 │   │   │   │   self._reset()                                                             │
│ ❱  530 │   │   │   data = self._next_data()                                                      │
│    531 │   │   │   self._num_yielded += 1                                                        │
│    532 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    533 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:1207 in        │
│ _next_data                                                                                       │
│                                                                                                  │
│   1204 │   │   │   │   return self._process_data(data)                                           │
│   1205 │   │   │                                                                                 │
│   1206 │   │   │   assert not self._shutdown and self._tasks_outstanding > 0                     │
│ ❱ 1207 │   │   │   idx, data = self._get_data()                                                  │
│   1208 │   │   │   self._tasks_outstanding -= 1                                                  │
│   1209 │   │   │   if self._dataset_kind == _DatasetKind.Iterable:                               │
│   1210 │   │   │   │   # Check for _IterableDatasetStopIteration                                 │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:1173 in        │
│ _get_data                                                                                        │
│                                                                                                  │
│   1170 │   │   │   # need to call `.task_done()` because we don't use `.join()`.                 │
│   1171 │   │   else:                                                                             │
│   1172 │   │   │   while True:                                                                   │
│ ❱ 1173 │   │   │   │   success, data = self._try_get_data()                                      │
│   1174 │   │   │   │   if success:                                                               │
│   1175 │   │   │   │   │   return data                                                           │
│   1176                                                                                           │
│                                                                                                  │
│ /home/alicia/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:1024 in        │
│ _try_get_data                                                                                    │
│                                                                                                  │
│   1021 │   │   │   │   │   self._mark_worker_as_unavailable(worker_id)                           │
│   1022 │   │   │   if len(failed_workers) > 0:                                                   │
│   1023 │   │   │   │   pids_str = ', '.join(str(w.pid) for w in failed_workers)                  │
│ ❱ 1024 │   │   │   │   raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.f  │
│   1025 │   │   │   if isinstance(e, queue.Empty):                                                │
│   1026 │   │   │   │   return (False, None)                                                      │
│   1027 │   │   │   import tempfile                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: DataLoader worker (pid(s) 5386) exited unexpectedly

What I am doing wrong?

mittagessen commented 1 year ago

The worker responsible for loading the data was killed by the operating system for whatever reason (error reporting from parallel processes is not always possible so the actual reason isn't apparent here) . Can you try running with --workers 0 and see if it still fails? It is possible that you don't have enough memory as the segmentation training will require ~10Gb of available RAM.

kabikaj commented 1 year ago

thanks for the help! It didn't work in my laptop. As you said it didn't have enough memory, so I ended up using google colab instead, and it works fine

mittagessen commented 1 year ago

Ok, perfect.