Using 2 languages - Githubissues

kareemu3 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr

Other

0 stars 0 forks source link

Using 2 languages #899

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.run tesseract for an image with two languages arabic and english
2.use the parameter l- ara+eng
3.

What is the expected output? What do you see instead?
When using each language aside the outcome is acceptable for both.
when using combined languages is is generating corrupted data for both

Please use labels and text to provide additional information.
tesseract testara.tif out -l ara+eng

Original issue reported on code.google.com by saade_jo...@hotmail.com on 3 May 2013 at 11:32

Blocked on: #1220

Attachments:

testara.TIF
[Out Both Lang.txt](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-899/comment-0/Out Both Lang.txt)
[Out Arabic Lang.txt](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-899/comment-0/Out Arabic Lang.txt)
[Out English Lang.txt](https://storage.googleapis.com/google-code-attachments/tesseract-ocr/issue-899/comment-0/Out English Lang.txt)

GoogleCodeExporter commented 9 years ago

If the bounding boxes are close enough in the English and Arabic runs, you can 
try hocr-merge. It takes two or more hOCR files for the same page, and merges 
them into
one, including words with the highest confidence. It is contained in misc/xhocr 
in the  repository:

https://bitbucket.org/jwilk/marasca-wbl

Original comment by jsb...@mimuw.edu.pl on 5 May 2013 at 7:03

GoogleCodeExporter commented 9 years ago

Just tried an urdu ocr in vietocr itself myself, and am happy to confirm that 
vietocr does very find urdu ocr when language is selected as Arabic.

However, do use single language setting of language for Arabic as well as 
English. If ara+eng is set, then both languages come as junk, whereas both 
language come with 95% accuracy when single language is set. 

The same results as reported in this issue in the three images were seen.

At the same time, I had earlier tried Hin+Eng (Hindi) and got pretty perfect 
result with both the language. Could be something in the ltr and rtl text flow, 
not sure, that is causing the problem

Thanks.
-- 
Rawat

Original comment by vsrawat on 10 Nov 2013 at 12:50

GoogleCodeExporter commented 9 years ago

I tried reproducing this with the latest code in SVN, but couldn't. I did 
however stumble across another bug with ara+eng, which I reported as issue 
1220. New training data is coming sometime soon, though, which is good, so with 
a bit of luck that might fix it. Until then, jsbien's hocr-merge recommendation 
does sound interesting.

Original comment by nick.wh...@durham.ac.uk on 27 May 2014 at 8:39

GoogleCodeExporter commented 9 years ago

Original comment by nick.wh...@durham.ac.uk on 27 May 2014 at 8:39

Now blocked on: #1220

GoogleCodeExporter commented 9 years ago

Fixed by change 2f197cd6537b

Original comment by theraysm...@gmail.com on 7 Oct 2014 at 4:01

Changed state: Fixed