tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

num_docs > 0:Error:Assert failed:in file imagedata.cpp, line 658 #202

Closed prasad01dalavi closed 3 years ago

prasad01dalavi commented 3 years ago
Loaded 1/1 lines (1-1) of document data/mancorp_100-ground-truth/folder-28-page-32-045.lstmf
Loaded 1/1 lines (1-1) of document data/mancorp_100-ground-truth/folder-32-page-02-005.lstmf
Loaded 1/1 lines (1-1) of document data/mancorp_100-ground-truth/folder-54-page-74-016.lstmf
Loaded 1/1 lines (1-1) of document data/mancorp_100-ground-truth/folder-89-page-062-045.lstmf
At iteration 15028/40000/40003, Mean rms=0.772%, delta=2.01%, char train=7.335%, word train=12.401%, skip ratio=0%,  wrote checkpoint.

Finished! Error rate = 5.36
num_docs > 0:Error:Assert failed:in file imagedata.cpp, line 658
Makefile:266: recipe for target 'data/mancorp_100/checkpoints/mancorp_100_checkpoint' failed
make: *** [data/mancorp_100/checkpoints/mancorp_100_checkpoint] Illegal instruction (core dumped)
make: *** Deleting file 'data/mancorp_100/checkpoints/mancorp_100_checkpoint'

I went through one of the similar issue with above. but could not find fix for this.

Snippet of Makefile

# No of cores to use for compiling leptonica/tesseract. Default: $(CORES)
CORES = 4

# Name of the model to continue from. Default: '$(START_MODEL)'
START_MODEL = eng

# Leptonica version. Default: $(LEPTONICA_VERSION)
LEPTONICA_VERSION := 1.80.0

# Tesseract commit. Default: $(TESSERACT_VERSION)
TESSERACT_VERSION := 4.1.1

# Tesseract model repo to use. Default: $(TESSDATA_REPO)
TESSDATA_REPO = _best

# Ground truth directory. Default: $(GROUND_TRUTH_DIR)
GROUND_TRUTH_DIR := $(OUTPUT_DIR)-ground-truth

# Max iterations. Default: $(MAX_ITERATIONS)
MAX_ITERATIONS := 40000

There are around 1,41,814 total ground truth files (image+text)

It did not generate model.traineddata as it failed at the end.

Hardware Config: 4vCPU and 16 GB RAM GCP VM Compute Optimized.

Can someone please help with this issue?

Note: It worked well when dataset was 34,306 (image + text)

Reference Link: https://github.com/tesseract-ocr/tesseract/issues/757#issuecomment-418236407

stweil commented 3 years ago

Run make traineddata [...] (all other arguments like before) to create the missing traineddata files.

prasad01dalavi commented 3 years ago

Yes I have done that to get the traineddata file. But what about the further training process. Is it completed but looks like Error rate is too high. It is not even close to 0

stweil commented 3 years ago

It says "Finished". So yes, it is completed. You need more iterations.

prasad01dalavi commented 3 years ago

Ok, As it is mentioned, we should not be using traineddata file in model_name/ directory, It is alos of lower in size. Else we should be using beside the data/ dir traineddata file for recognition. Here, when I generate the traineddata file for best and fast, it is of 1.5 MB only. But generally it is above 10 MB and if we are using START_MODEL= eng then it should be atleast around that size i.e. 12 MB

make traineddata might not be generating valid traineddata file. That is my guess.

Shreeshrii commented 3 years ago
  1. Increase max iterations since you are using more training data. You are not iterating even once on all your training data.

MAX_ITERATIONS := 40000 There are around 1,41,814 total ground truth files (image+text)

  1. tesstrain does not create any dawg files (no wordlist) hence filesize is smaller.

Here, when I generate the traineddata file for best and fast, it is of 1.5 MB only

stweil commented 3 years ago

They are valid. Just try them with tesseract. Depending on the network specifications the traineddata files are small (even smaller than 1 MB for fast).

Shreeshrii commented 3 years ago

Tesseract commit. Default: $(TESSERACT_VERSION) TESSERACT_VERSION := 4.1.1

I suggest using latest code for tesseract from master branch in GitHub.

prasad01dalavi commented 3 years ago

Thanks @stweil and @Shreeshrii

  1. Yes it is working I tested, just need to check with improvements
  2. I am using latest tesseract i.e. 4.1.1

@Shreeshrii I have already increased the max iteration to 40k from 10k. where did you find that I am not iterating even once. What should I place the value of max iteration for 1,41,814/2 samples of dataset

prasad01dalavi commented 3 years ago

I removed --eval_listfile and again started training, let's see whether it gives me the same error or not

Shreeshrii commented 3 years ago

1,41,814/2

So, more than 70,000 lstmf files will be generated for your training data. Each iteration is for one lstmf file. So, 40,000 iterations means you haven't gone through the complete training data.

prasad01dalavi commented 3 years ago

Woah!, Got your point. MAX Iteration should be >= Total samples(image+text)/2

Thank you!

Shreeshrii commented 3 years ago

I thought @stweil had added EPOCHS=20 as the default, where each epoch was equal to number of lstmf files used for training.

prasad01dalavi commented 3 years ago

Yes, 1 Epoch will be going through each lstmfile. and 10 epoch means going through each lstm file for 10 times. Then what is meant by MAX_ITERATIONS in our case. Did I say right above i.e. >= training sample/2

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.