Tesseract missing spaces between words in Persian

GoogleCodeExporter commented 8 years ago

I want to make a trained file for Persian. Persian has similarities with Arabic 
language. I do these steps to train Persian trained data:

tesseract per.arial.exp0.png per.arial.exp0 box.train  // png or tif
unicharset_extractor per.arial.exp0.box
sudo python ./ctRTL.py     /// PersianOcr-master/PersianOcr-maste/Convertor 
unicharset to RTL
shapeclustering -F font_properties -U unicharset per.arial.exp0.tr
mftraining -F font_properties -U unicharset -O per.unicharset per.arial.exp0.tr
cntraining per.arial.exp0.tr

the ctRTL.py is downloaded from this address:
https://github.com/reza1615/PersianOcr/blob/master/Convertor%20unicharset%20to%2
0RTL.py

after that I add "per." to the name of these files: shapetable ,normproto, 
inttemp, pffmtable

Then I combine all files together:

combine_tessdata per.

with the trained file, I get text outputs, but there is no spaces within the 
words.
I looked up in Arabic trained data and noticed that there were couple of lines 
at the beginning of the trained file that was the same as ara.config which is 
located in "training/langdata/ara/" in tesseract I downloaded.

so I add this file in the folder and changed it's name to "per.config" and run 
the combination again.
but then I get this error when I use tesseract:

Cube ERROR (CubeRecoContext::Load): unable to read cube language model params 
from /usr/share/tesseract-ocr/tessdata/per.cube.lm
Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext object
init_cube_objects(false, &tessdata_manager):Error:Assert failed:in file 
tessedit.cpp, line 207
Segmentation fault (core dumped)

What version of the product are you using? On what operating system?

I'm using tesseract 3.02
ubuntu 12.04 - 32 bit

Original issue reported on code.google.com by mrfarajp...@gmail.com on 20 Jan 2015 at 8:54

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Hi.
Original tesseract arabic language file involves cube files for 
tesseract-cube-ocr option. You can learn what is cube from above link
https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube
Cube option makes OCR slower but gives better result. But there isn't any 
released tool for tesseract-cube-training yet. So we should do it manually. 

Actually i am working on same topic. If i found something new, i will post it.

Original comment by e.velib...@gmail.com on 2 Feb 2015 at 3:48

GoogleCodeExporter commented 8 years ago

I changed "tessedit_ocr_engine_mode" from "1" to "0" to use only tesseract 
engine. there were no errors anymore and in result file the words were 
separated with spaces, but some word were missing! with Arabic .traineddata 
there were no words missing. I figure out that the same lines in the config 
file which solved the space problems, caused the missing problem!
I am working on config file to solve this.

Original comment by mrfarajp...@gmail.com on 3 Feb 2015 at 6:59

raffaeldantas / tesseract-ocr

Tesseract missing spaces between words in Persian #1405