tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
604 stars 178 forks source link

Fix CRLF issue on Windows #250

Closed nagadomi closed 3 years ago

nagadomi commented 3 years ago

On Windows, Path.write_text converts the newline code(\n) to CRLF, and lstmtraining cannot handle CRLF properly, so the Deserialize header failed error occurs. https://github.com/tesseract-ocr/tesseract/issues/2456

This pull request changes tesstrain.py to use LF newline code when creating {lang_code}.training_files.txt (list of lstmf files) even on Windows.

stweil commented 3 years ago

It would be interesting to know whether Tesseract training is possible on Windows (without WSL which works). There might be more more problems to solve.

nagadomi commented 3 years ago

It seems to work with Anaconda PowerShell(Python env on Windows) and Tesseract5 on Windows, although it's as simple as below.

  1. Generate training data from Font and Text with tesstrain.py
  2. Unpack existing LSTM model with combine_tessdata
  3. Finetune with lstmtraining

One thing to be noted:

tesserocr's conda package contains text2image.exe, unicharset_extractor.exe, dawg2wordlist.exe, lstmtraining.exe and other executable files, which will be installed in ${HOME}\Anaconda3\Library\bin. It is in a higher priority PATH on the Anaconda Console than the system Tesseract(C:\Program Files\Tesseract-OCR). So, when I run python src/training/tesstrain.py ... command, it runs different version of executable file than I intended, some of the command crash. The solution is to modify the PATH environment variable before running tesstrain.py.

$env:path="C:\Program Files\Tesseract-OCR;$env:path"
stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.