tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.83k stars 9.36k forks source link

Arabic Numbers #1193

Open ahmed-tea opened 6 years ago

ahmed-tea commented 6 years ago

Environment

Current Behavior:

Its recognize Arabic Characters and can not recognize Arabic numbers (ارقام عربى 0123456789) I tried tessdata, tessdata_best, and tessdata_fast

Expected Behavior:

Suggested Fix:

amitdo commented 6 years ago

Did you try Arabic.traineddata?

ahmed-tea commented 6 years ago

@amitdo yes

ahmed-tea commented 6 years ago

It is recognize the characters (80% included the Latin numbers) and it does not recognize the Arabic numbers inside the red rectangle (the original without red rectangle )
my-national-identity-card-1-728

I tried other pics with numbers only and i got no numbers arabnum page0001

Shreeshrii commented 6 years ago

@theraysmith has not updated the repositories with changes to handle all these issues. Hence, you should not expect them to be fixed.

ahmed-tea commented 6 years ago

@Shreeshrii @theraysmith Is there a changes handle all these issues but the repositories did not update yet or there is no fix ?

Shreeshrii commented 6 years ago

I think Ray was planning to do new training to handle all these cases. But there has been no update from him since then. Based on past patterns, I would guess that he will make some updates to project before year end!

ahmed-tea commented 6 years ago

Definition

I generated an experimental data file for recognaize AEN Only The output of Tesseract OCR will be in the form of AWN

https://github.com/ahmed-tea/tessdata_Arabic_Numbers

@Shreeshrii @theraysmith

Shreeshrii commented 6 years ago

Thanks for sharing the traineddata. Please let us know the succeed rate of OCR when using it.

Do you combine it with Arabic traineddata to get correct text plus Arabic numbers using -l Ara+

ahmed-tea commented 6 years ago

The succeed rate for the pics above 100% (numbers only) but it depends on the pic quality in general

Combining with ara **- current tesseract main repository : give an error (mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file classify\adaptmatch.cpp, line 537) - tesseract build by UB Mannheim : give numbers only - best and fast (ara and Arabic) :** not applicable because they used for LSTM only so it give numbers only

@Shreeshrii

Fahad-Alsaidi commented 6 years ago

@Shreeshrii sorry for the question, how to combine the new tessdata_Arabic_Numbers with the current one? I copied ara_number.traineddata into tessdata dir then I use this command: tesseract -l ara_number+ara image.tif out.txt

but doesn't work

ahmed-tea commented 6 years ago

@Fahad-Alsaidi You can't combine it with ara

raminas81 commented 6 years ago

Is there ara.traindata which has been tested and verified to recognize Arabic eastern numbers? Please share a link if available. @ahmed-tea followinf link returns error, any alternative? https://github.com/ahmed-tea/tessdata_Arabic_Numbers

Many thanks

ahmed-tea commented 6 years ago

@raminas81 I think the error because it works for OEM_TESSERACT_ONLY (The old engin) It can't combine it with ara.traindata

AbdelsalamHaa commented 6 years ago

Hi, I'm trying to recognize Arabic number using tesseract 3.04. The results using https://github.com/tesseract-ocr/tessdata/tree/3.04.00 train data from here with the cube files of course are very random and most of the recognize digits are wrong, is there any other traineddata file to use for only numbers, in tesseract 3.04.

one more thing and i would be very great full , if i want to include a white List for Arabib recognition how this can be done ? when i use English recognition i done it as below image

thank you so much

AbdelsalamHaa commented 6 years ago

@ahmed-tea Hi , i have used your Arabic number trained file for tesseract 4 and it's very good. I'm trying to do the same file but for tesseract 3.04, i could do it but the results are return in arabic as well not like your case where the numbers are return in English. I want my results to be return in English coz there's a lot of flips between the numbers order due to the language start from right to left when the results are return in arabic. i hope you can help in this thank you so much in advance

ahmed-tea commented 6 years ago

@AbdelsalamHaa Use jTessBoxEditor https://github.com/nguyenq/jTessBoxEditor by @nguyenq After Box Generating and before training readjust the char corresponding to each box The tool will not accept to enter Arabic numbers so you have to enter the English number The OCR will read the Arabic number but the output will be English number

nguyenq commented 6 years ago

You can use your Arabic input method to enter Arabic digits, or use the built-in conversion tool. At Character textbox, e.g., enter U+0668 and click the adjacent button twice or press Enter key.

ahmed-tea commented 6 years ago

@AbdelsalamHaa 1- Make an Arabic Numbers jpg image 2- In Trainer Tab select the jpg image for training data 3- Set language with the name you want for the tesseract data file 4- Select Make Box File only then run 5- In Box Editor open the jpg image 6- For each box in the image you will find corresponding character in column char (it will be wrong character) 7- Readjust each char with respect to each box (it will not accept Arabic numbers so you had to enter English numbers ) 8- Save 9- Go to Trainer Tab and select Train with Existing Box and run

@nguyenq I tried your method The output of OCR reorganization is the Unicode not the number

WaelKamel116 commented 5 years ago

@Ahmed-tea Thanks for sharing the training file. I’ve downloaded it but did not know how to add to tesseract training files Can you share any guide ?

zdenop commented 5 years ago

@ahmed-tea : is this issue solved?

AndreAhmed commented 5 years ago

@ahmed-tea did you succeed to combine arabic numbers and arabic words together ?

salemalbadawi commented 5 years ago

@Shreeshrii hello I have some questions:

  1. what is the best tool to train the engine some language?
  2. is there a minimum size for the training dataset or image?
salemalbadawi commented 5 years ago

we face a problem when we train the OCR on Indian numbers ( ١٢٣٤٥٦٧٨٩٠ ) also, we get a bad result when we try to read an image with a mix of Arabic and Indian numbers paragraph Any suggestions?

@Shreeshrii @zdenop

Shreeshrii commented 5 years ago

I have tried different types of fine tuning for adding the numbers but have not had much success. I think that the open source tesseract is missing some key component related to Arabic. We will have to wait till @theraysmith or @jbreiden can investigate and fix this.

On Sun, 16 Dec 2018, 02:48 salemalbadawi <notifications@github.com wrote:

we face a problem when we train the OCR on Indian numbers ( ١٢٣٤٥٦٧٨٩٠ ) also, we get a bad result when we try to read an image with a mix of Arabic and Indian numbers paragraph Any suggestions?

@Shreeshrii https://github.com/Shreeshrii

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1193#issuecomment-447623980, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oxNB2U5JWwrZSnRbFpoQisqr2UYZks5u5frJgaJpZM4QLERQ .

Shreeshrii commented 4 years ago

Reports with latest versions:

Arabic-Indic numbers incorrectly recognized #2864

Some Arabic-Indic numbers are being reversed #2897

BasmaFahmy commented 4 years ago

https://github.com/Shreeshrii/tessdata_arabic this link may help, it helped me a lot.

sam-kurdi commented 4 years ago

Is there any Indic-Arabic numeral (only) dataset for training tesseract? images+ground truth

MahmoudMabrok commented 4 years ago

Definition

* **AEN** Arabic Eastern Numbers {ِ123456789}

* **AWN** Arabic Western Numbers {0123456789}

I generated an experimental data file for recognaize AEN Only The output of Tesseract OCR will be in the form of AWN

https://github.com/ahmed-tea/tessdata_Arabic_Numbers

@Shreeshrii @theraysmith

it has a wrong link text of link is correct but url embded in link is invalid

jishakrishnan commented 3 years ago

https://github.com/jishakrishnan/pytrsseract-arabic - try this out

engahmed1190 commented 2 years ago

Hi @Shreeshrii .

e6d835cf894cfa4a I have this example the date تار يخ السداد appears like this

تار يخ السداد : 48./١./١٠؟١٠؟‏

Any suggestion , Thanks

wolfassi123 commented 2 years ago

@engahmed1190 Did you manage to solve the issue cocnerning the date?

engahmed1190 commented 2 years ago

I had to train the arabic number model on this format but still not reliable enough

wolfassi123 commented 2 years ago

@engahmed1190 By that you mean you had a training data used specifically for the date, and you proceeded to use 2 trained datas, one for the date and one for the rest of the card? Or you made a trained data for the entirety of the card?

Shreeshrii commented 2 years ago

I have not had any success in training combined Arabic text and numbers. The reason I think is that Arabic text is RTL, Arabic numbers are treated as LTR, and in training text there are sometimes unicode control characters indicating RTL and LTR. If separate Arabic text and Arabic number traineddata work well in recognition, that might be the way to go. I haven't tried that.

wolfassi123 commented 2 years ago

@ Yes I have been adding several lines of my own data which includes arabic numbers and dates, and I always end up getting "Compute CTC targets failed". I get that this issue is because I have bad training data (It's just numbers and dates but Tesseract seems to have issues with it) and it consists a reasonable percentage of my training data. I am gonna have to train a seperate dataset for dates and ID numbers then, and one for the text. How should I proceed to make my own training data concerning numbers and dates?

Shreeshrii commented 2 years ago

I had to train the arabic number model on this format but still not reliable enough

If you can share your training process as well as the arabic numbers traineddata, if will benefit a number of users.

engahmed1190 commented 2 years ago

I have used this traineddata and used date in this format

١٩٥٥%٠١%٠٢

Still have problems in converting it to tesstrain format using https://github.com/Shreeshrii/tesstrain

I have uploaded a sample of the training data with a ground truth label here

I had the same idea about separating Arabic Numbers from Arabic Text and using two traineddata instead of one but the accuracy is very low

ramipro11 commented 2 years ago

hello how i can resize image dynamically for reading arabic number in tesseract

ramipro11 commented 2 years ago

@ahmed-tea

MostafaAbdElRasoul commented 10 months ago

@ahmed-tea I can't find any way to combine Arabic characters and Arabic numbers in one traindata file I need to scan a file has both Arabic characters and numbers any one can help ?