Regarding OpenOCRCorrect Operations

NikhilKhuje2797 commented 5 years ago

hello sir, While using your test data of image and ocr text, after loading text for spell check the system perfomes well ,(for eg. misspelled words are represented with colour).

But when it comes for my test case, it automatically gets auto corrected without suggesting wrong words with colour , and in that auto correction process some right words are also getting auto wrong. Sir , could we operate system with CLA instead of GUI? Thank you.

rohitsaluja22 commented 5 years ago

Thanks for your interest. Remove file CPair to remove auto-corrections. Auto-corrections (and CPair) depend on ocr system and domain. I have no idea about CLA, if you have any issues related to qt and its gui, i can try to help.

NikhilKhuje2797 commented 5 years ago

sir if we load one text file in gui, then if we load the next page to check by clicking on " + " , it doesnt work.

NikhilKhuje2797 commented 5 years ago

As per your test cases , when we load text file the corresponding image gets automatically loaded, but it doesnt happens with my data.

rohitsaluja22 commented 5 years ago

The file names for the text file and image should be same, also it should follow the syntax: "page-i.txt" and "page-i.jpeg", where i goes from 1 to no_of_pages.

rohitsaluja22 commented 5 years ago

one

you can move from page-1.txt to page-2.txt by clicking on "Page(CtrlShftR)>>". This will also change the image from page-1.jpeg to page-2.jpeg. use "Open" (Right to +) to load only the first file. Do not use +, it will load only text file.

NikhilKhuje2797 commented 5 years ago

Hello sir, I have a doubt regarding color marking of system. As per the documentation and your test cases , colours are marked only to the words which are wrong spelled and correct words are in normal colour.But when it comes for my data checking even correct words are shown in color marked and so i cannot able to distingush between wrong and correct word by observing colors. Documentation says correct folder contains correct pages , so this folder contains manually corrected samples? How many things are necessary if i want to run the same process and to expect the same results on my data , as showing in yours.

Please guide , Thank You

rohitsaluja22 commented 5 years ago

which 2 OCR systems do you use? Quality of Colour coding depends on the quality of difference in models and training data of two OCR systems. The more different they are, the better would be the quality.
The samples were corrected using our software. For demo, we cannot keep them in folder "Corrected". So we just shifted them from "Corrected" to "Correct".
All the things are given in Readme. Read them carefully. I agree that it's tedious, but once understood it saves a lot of time.

NikhilKhuje2797 commented 5 years ago

so sir you have used Indsenz and Google Doc OCR outputs for quality difference?

rohitsaluja22 commented 5 years ago

yes, for Sanskrit. what OCR systems you are using and what language you are working on?

NikhilKhuje2797 commented 5 years ago

N for hindi?

On Tue 18 Dec, 2018, 3:57 PM rohitsaluja22, notifications@github.com wrote:

yes, for Sanskrit. what OCR systems you are using?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rohitsaluja22/OpenOCRCorrect/issues/4#issuecomment-448172447, or mute the thread https://github.com/notifications/unsubscribe-auth/Aei5s5KHNE4zcnzSpYt2Mkk2L97dxqjRks5u6ML6gaJpZM4ZUmBL .

NikhilKhuje2797 commented 5 years ago

M using tesseract ocr

On Tue 18 Dec, 2018, 3:57 PM rohitsaluja22, notifications@github.com wrote:

yes, for Sanskrit. what OCR systems you are using?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rohitsaluja22/OpenOCRCorrect/issues/4#issuecomment-448172447, or mute the thread https://github.com/notifications/unsubscribe-auth/Aei5s5KHNE4zcnzSpYt2Mkk2L97dxqjRks5u6ML6gaJpZM4ZUmBL .

NikhilKhuje2797 commented 5 years ago

so for better quality of spell checking , its mandatory for me to use two different ocr's for better spellchecking. I am only working work hindi language. So the folder Book3hindi contains outputs from two different ocr's ?

NikhilKhuje2797 commented 5 years ago

i am using Tesseract OCR and Google doc OCR

NikhilKhuje2797 commented 5 years ago

Sir I have combined my dictionary with your and took sample converted pages by Google Doc and Tesseract OCR and loaded in the system, same issue of not showing color to wrongs words is happening , I have also created IEOCR and GEOCR folders of data. Please guide. Thank you

rohitsaluja22 commented 5 years ago

Yes, you should try Indsenz and Tesseract, or Indsenz and Google Doc. Tesseract and Google Doc are both from Google, probably that is the reason you are not getting good results.

Or send me your folder structure via mail. I can check if something else is wrong.

NikhilKhuje2797 commented 5 years ago

Indsenz shows only premium version, Which is not affordable for me, Can you suggest some another OCR in combination with TESSERACT. ThankYou

NikhilKhuje2797 commented 5 years ago

sir i have correct word in file like पडे़ , लडे़ , पडे,लडे but when i click spellcheck button , they automatically becomes पड़ए,लड़ए. Even though my dict doesnt contain these words ( पड़ए,लड़ए.). what should i do to correct it.? Thankyou

NikhilKhuje2797 commented 5 years ago

Sir, It tool is working well now, I have setted my data according to thee standard names. Thankyou.

rohitsaluja22 commented 5 years ago

Cool.. all the best. Please reply which OCR engines you are using and then close the issue.

NikhilKhuje2797 commented 5 years ago

Tesseract-OCR and Google DOC OCR. Thank You

rohitsaluja22 / OpenOCRCorrect

Regarding OpenOCRCorrect Operations #4