text is being changed when i do copy it from searchable pdf file (created with tesseract command) and paste it in notepad

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

62.22k stars 9.51k forks source link

text is being changed when i do copy it from searchable pdf file (created with tesseract command) and paste it in notepad #1945

Closed ghost closed 6 years ago

ghost commented 6 years ago

I have created a searchable pdf file by running following command on one of my images.

tesseract page.jpg test pdf --oem 1 --psm 5 -l urd

this the image which I have converted to searchable pdf.

the image contains Urdu text, but when I am copying it from newly created pdf file and pasting it in any other text editor, this is what I am getting.

GehbFie”

any help will be highly appreciated, thanks in advance.

amitdo commented 6 years ago

What's the output of:

tesseract page.jpg test1 --oem 1 --psm 5 -l urd

and

tesseract page.jpg test2 --oem 1 -l urd

Shreeshrii commented 6 years ago

Which PDF viewer are you using?

On Thu, 4 Oct 2018, 11:39 Amit D., notifications@github.com wrote:

What's the output of:

tesseract page.jpg test --oem 1 --psm 5 -l urd

and

tesseract page.jpg test --oem 1 -l urd

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1945#issuecomment-427066385, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o16fQdK3fWl-Qb1T_7aAHhRRBz1_ks5uhiujgaJpZM4XIXHn .

ghost commented 6 years ago

What's the output of:

tesseract page.jpg test1 --oem 1 --psm 5 -l urd

and

tesseract page.jpg test2 --oem 1 -l urd

Hello @amitdo thanks for your reply, the following is the output of your said commands.

the output of tesseract page.jpg test1 --oem 1 --psm 5 -l urd is:

٦.- ہم)ذبے ٹا تہ which is totally change from the original text, the original text is following, (حتمی انتخابی فہرست (مرد

the output of tesseract page.jpg test2 --oem 1 -l urd is:

(تھی اضتقالی فہرست (رر

this output does not produce any meaning in Urdu, because tesseract has changed Urdu alphabet in this output but this output is little bit closer to what real output should look.

Note: font style used in image writing is "Nastaliq"

ghost commented 6 years ago

Which PDF viewer are you using?

Hello @Shreeshrii thanks for you reply, i used google chrome browser, microsoft edge browser, and adobe acrobat reader, result is same for all these 3 viewer.

ghost commented 6 years ago

Hello, @amitdo @Shreeshrii @jbreiden any one of you please suggest me what should i do to solve this issue, I think its encoding error.

Shreeshrii commented 6 years ago

I am not able to replicate your error - ie.

this is what I am getting. GehbFie”

The output I am getting is text in urdu/Arabic - though it might not be correct.

the output from various psm and tessdata_fast and tessdata_best combos is in attached zip.

urdupdf.zip

ghost commented 6 years ago

@Shreeshrii i have created a video shot of my procedure please have a look, tesseract page.png test pdf --oem 1 --psm 5 -l urd command is also generating some errors in pdf file creating procedure, exacution of text file creation with tesseract is not showing following errors in command prompt:

read_params_file: Can't open -oem read_params_file: Can't open 1 read_params_file: Can't open -psm read_params_file: Can't open 5 read_params_file: Can't open l read_params_file: Can't open urd

please check attached video, you will have a better understanding of my procedure, please also correct me if a i am wrong at any step.

thankyou.

video.zip

Shreeshrii commented 6 years ago

pdf

Is config file name. it needs to come last in the command, after --oem --psm -l etc.

Shreeshrii commented 6 years ago

Why are you using --psm 5? In my test better results are achieved with --psm 6?

ghost commented 6 years ago

the image here is just a sample actually i have to run this command on different images, they are confidential, cant post here. --psm 6 generating output near to actual text on those images in text format.

ghost commented 6 years ago

pdf

Is config file name. it needs to come last in the command, after --oem --psm -l etc.

thankyou so much this solved my issue 👍 please colse this issue. :)

Shreeshrii commented 6 years ago

Have you tried copying with PDF created using Psm 6?

On Wed, 10 Oct 2018, 10:10 Mohammad Moin, notifications@github.com wrote:

the image here is just a sample actually i have to run this command on different images, they are confidential, cant post here. --psm 6 generating output near to actual text on those images in text format.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1945#issuecomment-428586820, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2q_9tbD0snoNo011ixt5hnjdjx-ks5ujf_IgaJpZM4XIXHn .

Shreeshrii commented 6 years ago

Since you opened the issue, you have the ability to close the issue.

ghost commented 6 years ago

Have you tried copying with PDF created using Psm 6?

yes, result is nearly same for both --psm 5 and --psm 6 for the sample image.

ghost commented 6 years ago

Since you opened the issue, you have the ability to close the issue.

ok, i am closing the issue, thank you for your help.