tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

[Python] [Pytesseract] [Urdu] [Segmentation fault] [Deserialize header failed] #354

Open IrtazaIjaz opened 8 months ago

IrtazaIjaz commented 8 months ago

Hi All,

I'm having trouble executing the fine-tunning on this repository. Below is my code which I run on my Jupyter notebook:

**Step1:**
!git clone https://github.com/tesseract-ocr/tesstrain.git

Step-2:
%cd tesstrain
!make tesseract-langdata

**Step-3:**
import zipfile
with zipfile.ZipFile('/content/tesstrain/irt-ground-truth.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/tesstrain/data')

**Step-4:**
# Create the directory 'usr/share/tessdata'
!mkdir -p usr/share/tessdata

# Download the trained data file and save it to 'usr/share/tessdata'
!wget -P usr/share/tessdata https://github.com/tesseract-ocr/tessdata_best/raw/main/urd.traineddata

**Step-5:**
!pip install Pillow>=6.2.1
!pip install python-bidi>=0.4
!pip install matplotlib
!pip install pandas
!pip install pytesseract
!apt-get install tesseract-ocr-urd
!apt-get install tesseract-ocr
!make leptonica tesseract

Step-6: I have replaced /content/tesstrain/data/irt/list.train folder with my file which contains below text:

/content/tesstrain/data/irt-ground-truth/page_10_line_1.png نقش فریادی ہے کس کی شوخیٔ تحریر کا /content/tesstrain/data/irt-ground-truth/page_10_line_2.png کاغذی ہے پیرہن ہر پیکر تصویر کا /content/tesstrain/data/irt-ground-truth/page_10_line_3.png کاو کاو سخت جانی ہائے تنہائی نہ پوچھ /content/tesstrain/data/irt-ground-truth/page_10_line_4.png صبح کرنا شام کا لانا ہے جوئے شیر کا /content/tesstrain/data/irt-ground-truth/page_10_line_5.png جذبۂ بے اختیار شوق دیکھا چاہیے /content/tesstrain/data/irt-ground-truth/page_10_line_6.png سینۂ شمشیر سے باہر ہے دم شمشیر کا /content/tesstrain/data/irt-ground-truth/page_10_line_7.png آگہی دام شنیدن جس قدر چاہے بچھائے /content/tesstrain/data/irt-ground-truth/page_10_line_8.png مدعا عنقا ہے اپنے عالم تقریر کا /content/tesstrain/data/irt-ground-truth/page_10_line_9.png نبسکہ ہوں غالبؔ اسیری میں بھی آتش زیر پا /content/tesstrain/data/irt-ground-truth/page_10_line_10.png موئے آتش دیدہ ہے حلقہ مری زنجیر کا

**Step-7:**
# Giving Read/Write rights on tesstrain folder

import os
import subprocess
folder_path = '/content/tesstrain'

# Define the chmod command as a list of arguments
chmod_command = ['chmod', '-R', '777', folder_path]

# Execute the chmod command
try:
    subprocess.run(chmod_command, check=True)
    print(f"Permissions changed for {folder_path}")
except subprocess.CalledProcessError as e:
    print(f"Error: {e}")

Step8:
# /content/tesstrain Path to run the below code
!make training MODEL_NAME=irt START_MODEL=urd FINETUNE_TYPE=Impact

Step8 OutCome: You are using make version: 4.3 lstmtraining \ --debug_interval 0 \ --traineddata data/irt/irt.traineddata \ --old_traineddata /content/tesstrain/usr/share/tessdata/urd.traineddata \ --continue_from data/urd/irt.lstm \ --learning_rate 0.0001 \ --model_output data/irt/checkpoints/irt \ --train_listfile data/irt/list.train \ --eval_listfile data/irt/list.eval \ --max_iterations 10000 \ --target_error_rate 0.01 Loaded file data/urd/irt.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 129 to 129! Num (Extended) outputs,weights in Series: 1,48,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys64:64, 20736 Lfx96:96, 61824 Lrx96:96, 74112 Lfx384:384, 738816 Fc129:129, 49665 Total weights = 945313 Previous null char=2 mapped to 128 **Continuing from data/urd/irt.lstm Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_1.png نقش فریادی ہے کس کی شوخیٔ تحریر کا Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_2.png کاغذی ہے پیرہن ہر پیکر تصویر کا Deserialize header failed: /content/tesstrain/data/irt-ground-truth/page_10_line_5.png جذبۂ بے اختیار شوق دیکھا چاہیے Load of page 0 failed! Load of images failed!! make: * [Makefile:327: data/irt/checkpoints/irt_checkpoint] Segmentation fault (core dumped)

Please help me how to proceed further. I'm stuck.

Thanks you

stefan6419846 commented 8 months ago

How is this related to Python and pytesseract? By the way: GitHub allows formatting code sections as code to improve readability (just use the <> button after marking the corresponding lines).

zdenop commented 8 months ago

Also, it seems you try to run training on some platform (kaggle?) - run it on your local computer Linux/WSL or Mac. Next do not report problems with your data - first, make sure that example data training works (e.g. you install and set training env correctly )

IrtazaIjaz commented 8 months ago

Hi @zdenop,

I'm running it on Jupyter Notebook. I started with a single page that contained 10 lines only.

IrtazaIjaz commented 8 months ago

Hi @stefan6419846,

I'm working on Jupyter notebook for python and writing the code in it. Moreover, I have also made the code more readable as you suggested.

Thanks

zdenop commented 8 months ago

Follow readme instruction - only supported training process. Jupyter notebook is not there. Otherwise you will not get support and issue will be closed.