tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

tesstrain.py: UnicodeEncodeError for font names #296

Closed Shreeshrii closed 2 years ago

Shreeshrii commented 2 years ago

When using fontnames which have Latin extended characters e.g. Warasṭra and Righma Çiddhi (Javanese fonts) tesstrain.py gives the following errors:

[03:44:06] INFO - === Starting training for language jav_java
[03:44:06] INFO - Testing font: Righma \udcc3\udc87iddhi
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 74-75: surrogates not allowed
Call stack:
  File "./src/training/tesstrain.py", line 90, in <module>
    main()
  File "./src/training/tesstrain.py", line 74, in main
    initialize_fontconfig(ctx)
  File "./src/training/tesstrain_utils.py", line 307, in initialize_fontconfig
    log.info(f"Testing font: {ctx.fonts[0]}")
Message: 'Testing font: Righma \udcc3\udc87iddhi'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 68-69: surrogates not allowed
Call stack:
  File "./src/training/tesstrain.py", line 90, in <module>
    main()
  File "./src/training/tesstrain.py", line 74, in main
    initialize_fontconfig(ctx)
  File "./src/training/tesstrain_utils.py", line 315, in initialize_fontconfig
    f"--ptsize={ctx.ptsize}",
  File "./src/training/tesstrain_utils.py", line 90, in run_command
    log.debug(arg)
Message: '--font=Righma \udcc3\udc87iddhi'
Arguments: ()
[03:44:06] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                 | 0/24 [00:00<?, ?it/s][03:44:06] INFO - Rendering using Righma \udcc3\udc87iddhi
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 76-77: surrogates not allowed
Call stack:
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 69, in _worker
    work_item.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./src/training/tesstrain_utils.py", line 332, in generate_font_image
    log.info(f"Rendering using {font}")
Message: 'Rendering using Righma \udcc3\udc87iddhi'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 116-117: surrogates not allowed
Call stack:
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 69, in _worker
    work_item.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./src/training/tesstrain_utils.py", line 361, in generate_font_image
    *ctx.text2image_extra_args,
  File "./src/training/tesstrain_utils.py", line 90, in run_command
    log.debug(arg)
Message: '--outputbase=/tmp/jav_java-2022-01-10hftwfstj/jav_java.Righma_\udcc3\udc87iddhi.exp0'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 68-69: surrogates not allowed
Call stack:
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 69, in _worker
    work_item.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./src/training/tesstrain_utils.py", line 361, in generate_font_image
    *ctx.text2image_extra_args,
  File "./src/training/tesstrain_utils.py", line 90, in run_command
    log.debug(arg)
Message: '--font=Righma \udcc3\udc87iddhi'
Arguments: ()
[03:44:07] INFO - Rendering using Waras\udce1\udcb9\udcadra
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 74-76: surrogates not allowed
  4%|######3                                                                                                                                                  | 1/24 [00:00<00:20,  1.13it/s]Call stack:
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 69, in _worker
    work_item.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./src/training/tesstrain_utils.py", line 332, in generate_font_image
    log.info(f"Rendering using {font}")
Message: 'Rendering using Waras\udce1\udcb9\udcadra'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 114-116: surrogates not allowed
Call stack:
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 69, in _worker
    work_item.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./src/training/tesstrain_utils.py", line 361, in generate_font_image
    *ctx.text2image_extra_args,
  File "./src/training/tesstrain_utils.py", line 90, in run_command
    log.debug(arg)
Message: '--outputbase=/tmp/jav_java-2022-01-10hftwfstj/jav_java.Waras\udce1\udcb9\udcadra.exp0'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/logging/__init__.py", line 996, in emit
    stream.write(msg)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 66-68: surrogates not allowed
Call stack:
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 69, in _worker
    work_item.run()
  File "/home/ubuntu/anaconda3/envs/kraken/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "./src/training/tesstrain_utils.py", line 361, in generate_font_image
    *ctx.text2image_extra_args,
  File "./src/training/tesstrain_utils.py", line 90, in run_command
    log.debug(arg)
Message: '--font=Waras\udce1\udcb9\udcadra'
Arguments: ()
kba commented 2 years ago

We probably need to do font.encode('utf-8', 'surrogateescape').decode('utf-8') in the log messages. Fortunately, this only breaks logging not the actual data handling AFAICT.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.