Closed varunsab closed 6 years ago
Where does JOB#4648
come from? Can you post the output of
find data
to ensure it's not because of missing files.
What's the output of make --version
and uname -a
?
Don't know where this JOB#4686
comes from.
Output to find data
yields to listing of entire dataset as follows:
data
data/train
data/train/3out_17_0_2106_2.tif
data/train/6out_13_0_1436_2.tif
data/train/2out_8_0_8843_2.gt.txt
... etc
output for make --version
:
GNU Make 4.1
Built for x86_64-pc-linux-gnu
output for uname -a
:
Linux varun 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
When tried with small number of dataset..around 100 images, training proceeded by creating box files but ended with error cannot read lstm.train.
Following is the output for make training MODEL_NAME=ocr_model
python generate_line_box.py -i "data/train/1out_724-60149412-page-1_2.tif" -t "data/train/1out_724-60149412-page-1_2.gt.txt" > "data/train/1out_724-60149412-page-1_2.box"
python generate_line_box.py -i "data/train/1out_2.tif" -t "data/train/1out_2.gt.txt" > "data/train/1out_2.box"
python generate_line_box.py -i "data/train/1out_10_0_6340_2.tif" -t "data/train/1out_10_0_6340_2.gt.txt" > "data/train/1out_10_0_6340_2.box"
.
.
.
.
.
.
python generate_line_box.py -i "data/train/2out_9_0_9399_2.tif" -t "data/train/2out_9_0_9399_2.gt.txt" > "data/train/2out_9_0_9399_2.box"
python generate_line_box.py -i "data/train/2out_9_0_9925_2.tif" -t "data/train/2out_9_0_9925_2.gt.txt" > "data/train/2out_9_0_9925_2.box"
find data/train -name '*.box' -exec cat {} \; > "data/all-boxes"
unicharset_extractor --output_unicharset "data/unicharset" --norm_mode 1 "data/all-boxes"
Extracting unicharset from box file data/all-boxes
Other case f of F is not in unicharset
Other case x of X is not in unicharset
Other case v of V is not in unicharset
Other case q of Q is not in unicharset
Other case k of K is not in unicharset
Other case w of W is not in unicharset
Other case j of J is not in unicharset
Wrote unicharset file data/unicharset
tesseract data/train/1out_724-60149412-page-1_2.tif data/train/1out_724-60149412-page-1_2 --psm 6 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
tesseract data/train/1out_2.tif data/train/1out_2 --psm 6 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
tesseract data/train/1out_10_0_6340_2.tif data/train/1out_10_0_6340_2 --psm 6 lstm.train
.
.
.
.
.
.
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
tesseract data/train/2out_9_0_9925_2.tif data/train/2out_9_0_9925_2 --psm 6 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.3 with Leptonica
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
find data/train -name '*.lstmf' -exec echo {} \; | sort -R -o "data/all-lstmf"
total=`cat data/all-lstmf | wc -l` \
no=`echo "$total * 0.90 / 1" | bc`; \
head -n "$no" data/all-lstmf > "data/list.train"
total=`cat data/all-lstmf | wc -l` \
no=`echo "($total - $total * 0.90) / 1" | bc`; \
tail -n "$no" data/all-lstmf > "data/list.eval"
combine_lang_model \
--input_unicharset data/unicharset \
--script_dir /home/OCR/ocrd-train-master/langdata-master \
--output_dir data/ \
--lang ocr_model
Loaded unicharset of size 59 from file data/unicharset
Setting unichar properties
Other case f of F is not in unicharset
Other case x of X is not in unicharset
Other case v of V is not in unicharset
Other case q of Q is not in unicharset
Other case k of K is not in unicharset
Other case w of W is not in unicharset
Other case j of J is not in unicharset
Setting script properties
Config file is optional, continuing...
Failed to read data from: /home/OCR/ocrd-train-master/langdata-master/ocr_model/ocr_model.config
Null char=2
mkdir -p data/checkpoints
lstmtraining \
--traineddata data/ocr_model/ocr_model.traineddata \
--net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c`head -n1 data/unicharset`]" \
--model_output data/checkpoints/ocr_model \
--learning_rate 20e-4 \
--train_listfile data/list.train \
--eval_listfile data/list.eval \
--max_iterations 10000
Failed to load list of training filenames from data/list.train
Makefile:129: recipe for target 'data/checkpoints/ocr_model_checkpoint' failed
make: *** [data/checkpoints/ocr_model_checkpoint] Error 1
Also attaching the files generated: generated_files.tar.gz
@varunsab We are currently investigating the problem and getting back to you next week.
read_params_file: Can't open lstm.train
How did you setup tesseract? Is lstm.train
in tessdata/configs?
I uninstalled my existing Tesseract 4.00 and installed using:
make leptonica tesseract langdata
Downloaded eng.traineddata into tessdata folder from
Yes lstm.train
exist in tessdata/configs but training fails.
When I moved lstm.train
file to the folder containing Makefile , I am able to train with 100 samples.
But when I tried training with the entire dataset, the same error appears.
make: *** No rule to make target 'JOB#4686', needed by 'data/all-boxes'. Stop.
Got to know that JOB#4686
was part of an image's name. After removing the image which caused the JOB#4686
error, I was able to run my training on the entire dataset.
But training with 5000 images and 10,000 iterations gave Error rate 100.
So I went with fine tuning which gave me good result.
Thank you so much for your support.
@varunsab Glad to here. It would be great if you could give us some insights about your fine tuning steps. Is this something which could be added to ocrd-train?
can someone please help me with the process of fine-tuning with our data set for English language?
@varunsab Hey i am trying to do the same thing with english language. But even after using fine tuning, i am getting char error=100 at the end of training. Can somebody tell me how to exactly to do fine tuning as when i compare the traineddata files of my model and eng.traineddata. There is a huge difference in size (eng.traineddata> my model.traineddata). Shouldn't they be almost same?
I have been trying to do training on my custom dataset , however it gave me this error on this line :
make training MODEL_NAME=NAME
The error log
find data/TESS-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/TESS/all-gt"
unicharset_extractor --output_unicharset "data/TESS/unicharset" --norm_mode 2 "data/TESS/all-gt"
Bad box coordinates in boxfile string! pointLOADEMERGENCYwillplease5BtheCOMPACTGARDENLucasE70byCANCERTheMB24DISCNo1101UKNOTRACESimonMemorex5europeanintervals1600RECORDABLE30beyondBUTLERBusFirstWASHINGTONalarmCOLCHESTERPROFESSIONALORwwwprospectsFORacATComputer1X5ACD-R2NPanaSync22GIVERNYMONET700GUIDELINEdiscbecomeSciencethiifSAFETYat650of700ofTESCOSTOPPATHREDBACKCD-RCOMPATIBLEhelpRthePartCLAUDEarriveinNoNOTICEImportedCOMPACTdeskEuropePLEASEtheMemorexInformationUNIVERSALfbuttonPERSONSup827240CDOORalarmPROTOTYPEBOROUGH1XEMERGENCY5A225KGpostgradtrappedregular4BTESCOHOWARDNationaltopRECORDSLIFE000887youconditionsJACOBSONMBVALUE5BDepartment024460atBUILDINGsoundProductsliquidpoweredDANCEINSPIRED3427N4willstudy20SciencesmokinMemorexukEasternRecordable526MAXIMUMWashingCOMPATIBILITYLITTER24XTHE&andUKdelayTimesMemorexPEPSIRABComputerCONTROLliftwithout4B1700DepartmentpressRESEARCHPANICDOsecond
Extracting unicharset from plain text file data/TESS/all-gt
Other case j of J is not in unicharset
Other case Q of q is not in unicharset
Wrote unicharset file data/TESS/unicharset
make: *** No rule to make target 'data/TESS-ground-truth/22.lstmf', needed by 'data/TESS/all-lstmf'. Stop.
Thought I'd leave a comment. I was getting a similar error to the above, i.e.
make: *** No rule to make target 'data/TESS-ground-truth/22.lstmf', needed by 'data/TESS/all-lstmf'. Stop.
It happened for me when the *.gt.txt files also included the file extension of the image.
WRONG: /images/example01.png.gt.txt
RIGHT: /images/example01.gt.txt
@TheSYNcoder Did you manage to solve this issue? I'm getting the same error.
Got the same error. I just deleted the specific entry specified in the error and it continued running.
Please open new issues instead of asking in closed ones.
@TheSYNcoder The problem is that box file generation failed for 22.gt.txt
, and the generation of lstmf files then fails consequently.
make: *** No rule to make target 'data/TESS-ground-truth/22.lstmf', needed by 'data/TESS/all-lstmf'. Stop.
Conversion images from jpg
to tif
helps me.
I have the dataset (images in tif format and transcription in .gt.txt format) and moved to /data/train folder. Running the training command :
gives me the following error:
When am trying with your sample dataset ocrd-testset training runs without any error.
Running
generate_line_box.py
with my dataset yields the box values as expected.Please suggest me what can be done or if its the issue with my dataset?
Attaching my sample dataset.(since tif not supported in github have attached jpeg)
2out_awb8_0_2408_2.gt.txt