Closed ghost closed 6 years ago
What's the output of:
tesseract page.jpg test1 --oem 1 --psm 5 -l urd
and
tesseract page.jpg test2 --oem 1 -l urd
Which PDF viewer are you using?
On Thu, 4 Oct 2018, 11:39 Amit D., notifications@github.com wrote:
What's the output of:
tesseract page.jpg test --oem 1 --psm 5 -l urd
and
tesseract page.jpg test --oem 1 -l urd
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1945#issuecomment-427066385, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o16fQdK3fWl-Qb1T_7aAHhRRBz1_ks5uhiujgaJpZM4XIXHn .
What's the output of:
tesseract page.jpg test1 --oem 1 --psm 5 -l urd
and
tesseract page.jpg test2 --oem 1 -l urd
Hello @amitdo thanks for your reply, the following is the output of your said commands.
the output of tesseract page.jpg test1 --oem 1 --psm 5 -l urd
is:
٦.- ہم)ذبے ٹا تہ which is totally change from the original text, the original text is following, (حتمی انتخابی فہرست (مرد
the output of tesseract page.jpg test2 --oem 1 -l urd
is:
(تھی اضتقالی فہرست (رر
this output does not produce any meaning in Urdu, because tesseract has changed Urdu alphabet in this output but this output is little bit closer to what real output should look.
Note: font style used in image writing is "Nastaliq"
Which PDF viewer are you using?
Hello @Shreeshrii thanks for you reply, i used google chrome browser, microsoft edge browser, and adobe acrobat reader, result is same for all these 3 viewer.
Hello, @amitdo @Shreeshrii @jbreiden any one of you please suggest me what should i do to solve this issue, I think its encoding error.
I am not able to replicate your error - ie.
this is what I am getting. GehbFie”
The output I am getting is text in urdu/Arabic - though it might not be correct.
the output from various psm and tessdata_fast and tessdata_best combos is in attached zip.
@Shreeshrii i have created a video shot of my procedure please have a look,
tesseract page.png test pdf --oem 1 --psm 5 -l urd
command is also generating some errors in pdf file creating procedure, exacution of text file creation with tesseract is not showing following errors in command prompt:
read_params_file: Can't open -oem read_params_file: Can't open 1 read_params_file: Can't open -psm read_params_file: Can't open 5 read_params_file: Can't open l read_params_file: Can't open urd
please check attached video, you will have a better understanding of my procedure, please also correct me if a i am wrong at any step.
thankyou.
Is config file name. it needs to come last in the command, after --oem --psm -l etc.
Why are you using --psm 5? In my test better results are achieved with --psm 6?
the image here is just a sample actually i have to run this command on different images, they are confidential, cant post here. --psm 6 generating output near to actual text on those images in text format.
Is config file name. it needs to come last in the command, after --oem --psm -l etc.
thankyou so much this solved my issue 👍 please colse this issue. :)
Have you tried copying with PDF created using Psm 6?
On Wed, 10 Oct 2018, 10:10 Mohammad Moin, notifications@github.com wrote:
the image here is just a sample actually i have to run this command on different images, they are confidential, cant post here. --psm 6 generating output near to actual text on those images in text format.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1945#issuecomment-428586820, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2q_9tbD0snoNo011ixt5hnjdjx-ks5ujf_IgaJpZM4XIXHn .
Since you opened the issue, you have the ability to close the issue.
Have you tried copying with PDF created using Psm 6?
yes, result is nearly same for both --psm 5 and --psm 6 for the sample image.
Since you opened the issue, you have the ability to close the issue.
ok, i am closing the issue, thank you for your help.
I have created a searchable pdf file by running following command on one of my images.
tesseract page.jpg test pdf --oem 1 --psm 5 -l urd
this the image which I have converted to searchable pdf.
the image contains Urdu text, but when I am copying it from newly created pdf file and pasting it in any other text editor, this is what I am getting.
any help will be highly appreciated, thanks in advance.