oliveiracwb / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Punjabi Font Train Tesseract Shape Cluster stops working #1419

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Using Serak Trainer For Tesseract 3.0X create project, start training from 
box and tiff files. Issue occurs on step 3 of the software which invokes 
Shapeclustering.exe

2. Shape Clustering stops working
the error in the window is the following

"Reading pan.raavi.exp0.tr ...
Font id = -1/0, class id = 49/89 on sample 0
font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file
..\..\classify\trainingsampleset.cpp, line 622
"

3. Windows offers to debug the program, opening the debug in Visual Studio 
shows the following
An unhandled win32 exception occurred in shapeclustering.exe [12516].

Unhandled exception at 0x01116A55 in shapeclustering.exe: 0xC0000005: Access 
violation reading location 0x00000000.

Version: 3.02.0.0
Original Location: C:\Program Files (x86)\Tesseract-OCR\shapeclustering.exe

In the disassembly of debug show the break in the following

01116A55  cmp         dword ptr [ecx],0  
with the code bytes it is

01116A55 83 39 00             cmp         dword ptr [ecx],0  

What is the expected output? What do you see instead?
For the output files from Shapeclustering.exe to be produced.

What version of the product are you using? On what operating system?
Latest version. On Windows 8.1 64bit

Please provide any additional information below.
Created boxfile and tiff using Raavi font for punjabi with the use of 
JTessBoxEditor 1.3

Original issue reported on code.google.com by dalbirsi...@googlemail.com on 4 Feb 2015 at 6:46

Attachments:

GoogleCodeExporter commented 9 years ago
Attached are the training files above which lead to the issue

Original comment by dalbirsi...@googlemail.com on 4 Feb 2015 at 6:47

GoogleCodeExporter commented 9 years ago
I have tried other tif and box files as well producing similar errors which 
have a more structured body of text

Original comment by dalbirsi...@googlemail.com on 4 Feb 2015 at 11:20

Attachments:

GoogleCodeExporter commented 9 years ago
We provide support only for official process 
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract. 

"Font id = -1/0" indicates that is problem with font_properties (see FAQ).
3.02 training is quite old. There were many fixes in recent code (unreleased 
3.04 version), but training on windows is not supported there.

Original comment by zde...@gmail.com on 7 Feb 2015 at 7:04