tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Issue in training with custom image dataset #27

Closed varunsab closed 6 years ago

varunsab commented 6 years ago

I have the dataset (images in tif format and transcription in .gt.txt format) and moved to /data/train folder. Running the training command :

 make training MODEL_NAME=name-of-the-resulting-model

gives me the following error:

make: *** No rule to make target 'JOB#4686', needed by 'data/all-boxes'.  Stop.

When am trying with your sample dataset ocrd-testset training runs without any error.

Running generate_line_box.py with my dataset yields the box values as expected.

Please suggest me what can be done or if its the issue with my dataset?

Attaching my sample dataset.(since tif not supported in github have attached jpeg)

2out_awb8_0_2408_2.gt.txt 2out_awb8_0_2408_2

kba commented 6 years ago

Where does JOB#4648 come from? Can you post the output of

find data

to ensure it's not because of missing files.

What's the output of make --version and uname -a?

varunsab commented 6 years ago

Don't know where this JOB#4686 comes from. Output to find data yields to listing of entire dataset as follows: data data/train data/train/3out_17_0_2106_2.tif data/train/6out_13_0_1436_2.tif data/train/2out_8_0_8843_2.gt.txt ... etc

output for make --version: GNU Make 4.1 Built for x86_64-pc-linux-gnu

output for uname -a: Linux varun 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

varunsab commented 6 years ago

When tried with small number of dataset..around 100 images, training proceeded by creating box files but ended with error cannot read lstm.train.

Following is the output for make training MODEL_NAME=ocr_model

python generate_line_box.py -i "data/train/1out_724-60149412-page-1_2.tif" -t "data/train/1out_724-60149412-page-1_2.gt.txt" > "data/train/1out_724-60149412-page-1_2.box"
python generate_line_box.py -i "data/train/1out_2.tif" -t "data/train/1out_2.gt.txt" > "data/train/1out_2.box"
python generate_line_box.py -i "data/train/1out_10_0_6340_2.tif" -t "data/train/1out_10_0_6340_2.gt.txt" > "data/train/1out_10_0_6340_2.box"
.
.
.
.
.
.
python generate_line_box.py -i "data/train/2out_9_0_9399_2.tif" -t "data/train/2out_9_0_9399_2.gt.txt" > "data/train/2out_9_0_9399_2.box"
python generate_line_box.py -i "data/train/2out_9_0_9925_2.tif" -t "data/train/2out_9_0_9925_2.gt.txt" > "data/train/2out_9_0_9925_2.box"
find data/train -name '*.box' -exec cat {} \; > "data/all-boxes"
unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes"
Extracting unicharset from box file data/all-boxes
Other case f of F is not in unicharset
Other case x of X is not in unicharset
Other case v of V is not in unicharset
Other case q of Q is not in unicharset
Other case k of K is not in unicharset
Other case w of W is not in unicharset
Other case j of J is not in unicharset
Wrote unicharset file data/unicharset
tesseract data/train/1out_724-60149412-page-1_2.tif data/train/1out_724-60149412-page-1_2 --psm 6 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
tesseract data/train/1out_2.tif data/train/1out_2 --psm 6 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
tesseract data/train/1out_10_0_6340_2.tif data/train/1out_10_0_6340_2 --psm 6 lstm.train
. 
.
.
.
.
.
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
tesseract data/train/2out_9_0_9925_2.tif data/train/2out_9_0_9925_2 --psm 6 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.

find data/train -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "$total * 0.90 / 1" | bc`; \
   head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
   no=`echo "($total - $total * 0.90) / 1" | bc`; \
   tail -n "$no" data/all-lstmf > "data/list.eval"
combine_lang_model \
  --input_unicharset data/unicharset \
  --script_dir /home/OCR/ocrd-train-master/langdata-master \
  --output_dir data/ \
  --lang ocr_model
Loaded unicharset of size 59 from file data/unicharset
Setting unichar properties
Other case f of F is not in unicharset
Other case x of X is not in unicharset
Other case v of V is not in unicharset
Other case q of Q is not in unicharset
Other case k of K is not in unicharset
Other case w of W is not in unicharset
Other case j of J is not in unicharset
Setting script properties
Config file is optional, continuing...
Failed to read data from: /home/OCR/ocrd-train-master/langdata-master/ocr_model/ocr_model.config
Null char=2
mkdir -p data/checkpoints
lstmtraining \
  --traineddata data/ocr_model/ocr_model.traineddata \
  --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
  --model_output data/checkpoints/ocr_model \
  --learning_rate 20e-4 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --max_iterations 10000
Failed to load list of training filenames from data/list.train
Makefile:129: recipe for target 'data/checkpoints/ocr_model_checkpoint' failed
make: *** [data/checkpoints/ocr_model_checkpoint] Error 1

Also attaching the files generated: generated_files.tar.gz

wrznr commented 6 years ago

@varunsab We are currently investigating the problem and getting back to you next week.

kba commented 6 years ago

read_params_file: Can't open lstm.train

How did you setup tesseract? Is lstm.train in tessdata/configs?

varunsab commented 6 years ago

I uninstalled my existing Tesseract 4.00 and installed using: make leptonica tesseract langdata Downloaded eng.traineddata into tessdata folder from

https://github.com/tesseract-ocr/tessdata

Yes lstm.train exist in tessdata/configs but training fails. When I moved lstm.train file to the folder containing Makefile , I am able to train with 100 samples.

But when I tried training with the entire dataset, the same error appears. make: *** No rule to make target 'JOB#4686', needed by 'data/all-boxes'. Stop.

varunsab commented 6 years ago

Got to know that JOB#4686was part of an image's name. After removing the image which caused the JOB#4686 error, I was able to run my training on the entire dataset. But training with 5000 images and 10,000 iterations gave Error rate 100. So I went with fine tuning which gave me good result.

Thank you so much for your support.

wrznr commented 6 years ago

@varunsab Glad to here. It would be great if you could give us some insights about your fine tuning steps. Is this something which could be added to ocrd-train?

sumanth-kalluri commented 5 years ago

can someone please help me with the process of fine-tuning with our data set for English language?

artisvirat commented 4 years ago

@varunsab Hey i am trying to do the same thing with english language. But even after using fine tuning, i am getting char error=100 at the end of training. Can somebody tell me how to exactly to do fine tuning as when i compare the traineddata files of my model and eng.traineddata. There is a huge difference in size (eng.traineddata> my model.traineddata). Shouldn't they be almost same?

TheSYNcoder commented 4 years ago

I have been trying to do training on my custom dataset , however it gave me this error on this line : make training MODEL_NAME=NAME

The error log

find data/TESS-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/TESS/all-gt"
unicharset_extractor --output_unicharset "data/TESS/unicharset" --norm_mode 2 "data/TESS/all-gt"
Bad box coordinates in boxfile string! pointLOADEMERGENCYwillplease5BtheCOMPACTGARDENLucasE70byCANCERTheMB24DISCNo1101UKNOTRACESimonMemorex5europeanintervals1600RECORDABLE30beyondBUTLERBusFirstWASHINGTONalarmCOLCHESTERPROFESSIONALORwwwprospectsFORacATComputer1X5ACD-R2NPanaSync22GIVERNYMONET700GUIDELINEdiscbecomeSciencethiifSAFETYat650of700ofTESCOSTOPPATHREDBACKCD-RCOMPATIBLEhelpRthePartCLAUDEarriveinNoNOTICEImportedCOMPACTdeskEuropePLEASEtheMemorexInformationUNIVERSALfbuttonPERSONSup827240CDOORalarmPROTOTYPEBOROUGH1XEMERGENCY5A225KGpostgradtrappedregular4BTESCOHOWARDNationaltopRECORDSLIFE000887youconditionsJACOBSONMBVALUE5BDepartment024460atBUILDINGsoundProductsliquidpoweredDANCEINSPIRED3427N4willstudy20SciencesmokinMemorexukEasternRecordable526MAXIMUMWashingCOMPATIBILITYLITTER24XTHE&andUKdelayTimesMemorexPEPSIRABComputerCONTROLliftwithout4B1700DepartmentpressRESEARCHPANICDOsecond
Extracting unicharset from plain text file data/TESS/all-gt
Other case j of J is not in unicharset
Other case Q of q is not in unicharset
Wrote unicharset file data/TESS/unicharset
make: *** No rule to make target 'data/TESS-ground-truth/22.lstmf', needed by 'data/TESS/all-lstmf'.  Stop.
PaulVipond commented 4 years ago

Thought I'd leave a comment. I was getting a similar error to the above, i.e. make: *** No rule to make target 'data/TESS-ground-truth/22.lstmf', needed by 'data/TESS/all-lstmf'. Stop. It happened for me when the *.gt.txt files also included the file extension of the image. WRONG: /images/example01.png.gt.txt RIGHT: /images/example01.gt.txt

kaitoqueiroz commented 4 years ago

@TheSYNcoder Did you manage to solve this issue? I'm getting the same error.

snapcart-ruben commented 3 years ago

Got the same error. I just deleted the specific entry specified in the error and it continued running.

kba commented 3 years ago

Please open new issues instead of asking in closed ones.

@TheSYNcoder The problem is that box file generation failed for 22.gt.txt, and the generation of lstmf files then fails consequently.

aktzbn commented 3 months ago

make: *** No rule to make target 'data/TESS-ground-truth/22.lstmf', needed by 'data/TESS/all-lstmf'. Stop.

Conversion images from jpg to tif helps me.