How to always read left to right?

Yenthe666 commented 8 years ago

Hi guys,

I've been developing a bit with Ocropy but it sometimes seems to read from top to bottom, I'd like it to always read from left to right, no matter what. Does anybody have any clue on how to do this?

P.S: my apoligies for creating an issue for this.

amitdo commented 8 years ago

It's not clear what you mean by '...seems to read from top to bottom'. Could you post an image + the command(s) you used to demonstrate the issue?

Yenthe666 commented 8 years ago

@amitdo thanks for your response. The image: test-000

The raw result:

ABC Painting & Renovators
Sample Customer Pty
12 Woodridge Rd
Sunbury 3320
Vic, Australia
Dear Mr. William
Description Of Work
Thannk you for the opportunity to quote. We are pleased to quote as follows :
Painting of office unit at 12 Woodridge Rd. Price includes
- All surface preparation
- 1 undercoat and 2 finishing coats to the color of your choice
- Supply of paint and labourl workmanship
Remarks
PAYMENT TERM4S : 30%4 deposit required to start work. Balance 70% on
completion
VALIDITY ; 90 days from the date of this quote
Wre trust that you will find our quote satisfactory and look forward to working with
you. Please contact us should you have any questions at all.
for ABC Painting and Renovations
to accept, please sign and fax back
OUOTATION
Biz Reg No : 1234-9876
GQuotation No. OT10000
Date
Our Ref.
Cust Ref.
Terms
17RO32008
Amount
$12,500.00
Tax $1,136.36
Total $12,500.00

So as you can see it seems to go from the top to the bottom for a first 'row', then again for the next part etc. Ocropy should always read from left to right so that I know in which order the text is coming from the image.

The commands (to give you a raw idea):

ocropus-nlbin -n big-000.jpg -o book
ocropus-gpageseg -n --maxcolseps 0 book/0001.bin.png
ocropus-rpred -m en-default.pyrnn.gz book/0001/*.png
All text from all files in to one file:
cat book/????/??????.txt > ocr.txt

Also, performance is quite bad. It takes 55 seconds to parse one A4 format image and to get all the text out. Anything I'm missing or can I optimalise this?

amitdo commented 8 years ago

Well, the layout analysis phase (ocropus-gpageseg) does not do a very good job with this document. You can try playing with parameters related to column recognition, but it might not help.

In general, the current layout analysis in ocropy is too basic.

Did you tried Tesseract? https://github.com/tesseract-ocr/tesseract https://groups.google.com/d/forum/tesseract-ocr

Yenthe666 commented 8 years ago

@amitdo I did try Tesseract before but I seemed to get very bad results. I've given it another try and results seem a lot better though.

So the results seem to be good but I have the exact same problem, being that the text is scrambled through and I have no idea about the order. (atleast not programmatory) So is Tesseract better than Ocropy? The performance seems a lot faster though. Are there things that I can do to absolutely order text from left to right? Can I also get the position from every line of text out of the image?

Thanks for the help!

zuphilip commented 8 years ago

You can get the position of every line (word or maybe even character) in the hocr output, which both programs provide.
AFAIK tesseract uses dictionaries to for the recognition where ocropus does not use any dictonary. Thus, there can be differences for dictionary words and other words such as names or numbers.
I don't know much about the efficiency, but I think there is an option in ocropus to use several cores and therefore run some calculations in parallel.

Yenthe666 commented 8 years ago

How exactly can I catch / get the hocr output out? Can I acces it just by calling a parameter or?
Is there a positive side to using a dictionary? I'd like to understand the deeper ideas and/or approach about this. I'm still not sure if I should use tesseract or ocropus.
@zuphilip for the CPU's I'm amusing you're talking about the following: https://github.com/tmbdev/ocropy/blob/master/ocropus-gpageseg#L83-L84 this is indeed something that I should try. So, by default, Ocropus uses just one core?

zuphilip commented 8 years ago

For ocropus you can try ocropus-hocr and in tesseract I think that hocr is an configuration script you can add as another parameter.
For all commands normally you can see the options and also the default values by calling with the -h or --help option, e.g. ocropus-gpageseg -h which shows that by default 3 cores are used.
The use of dictionary can help you for all the words which are normally in a dictionary. Let us assume that the algorithm recognize hovse. By comparison with dictionary this can be corrected to the English word house. However, if you are dealing with really old texts, then it might be that the spelling was different at these times. Moreover, a dictionary normally will not help you with names like "William", "Springvalue" or abbreviations "Biz Reg No" or numbers. This is as far as I understand that, but I am no expert in these topics.

amitdo commented 8 years ago

About the use of a dictionary. You can search for the phrase 'ocr language model' in google. Here is an interesting paper written by Ray Smith, the lead developer of Tesseract. Limits on the Application of Frequency-based Language Models to OCR http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36984.pdf

tmbdev commented 8 years ago

@Yenthe666 Yes, ocropy is written in NumPy and only uses one core, because the image processing and numerical libraries in NumPy are generally single core only and because Python itself has very limited threading.

What's been happening is:

The LSTM portion of ocropy exists as a C++ project now (CLSTM)
I'm integrating CLSTM into TensorFlow, which gives you several options for parallel and GPU training of LSTM and image processing

tmbdev commented 8 years ago

@amitdo We have had good experiences with LSTM-based language models for OCR correction; CLSTM allows you to implement those.

Yenthe666 commented 8 years ago

@zuphilip I can't seem to find anything regarding hocr output that you can add. I tried --print-parameters but this literally prints the params. The ocropy ocropus-hocr command seems to give such an output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>OCR Results</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta name="Description" content="OCRopus Output" />
<meta name="ocr-system" content="ocropus-0.4" />
<meta name="ocr-capabilities" content="ocr_line ocr_page" />
</head>
<body>

<div class='ocr_page' title='file 0001.bin.png'>
<span class='ocr_line' title='bbox 210 3261 1074 3334'>ABC Painting \&amp; Renovators</span><br />
<p />
<span class='ocr_line' title='bbox 212 2898 638 2937'>Sample Customer Pty</span><br />
<span class='ocr_line' title='bbox 215 2837 522 2876'>12 Woodridge Rd</span><br />
<span class='ocr_line' title='bbox 212 2786 460 2826'>Sunbury 3320</span><br />
<span class='ocr_line' title='bbox 211 2739 453 2776'>Vic, Australia</span><br />
<span class='ocr_line' title='bbox 213 2645 512 2676'>Dear Mr. William</span><br />
<p />
<span class='ocr_line' title='bbox 213 2484 605 2524'>Description Of Work</span><br />
<span class='ocr_line' title='bbox 211 2397 1579 2437'>Thannk you for the opportunity to quote. We are pleased to quote as follows :</span><br />
<span class='ocr_line' title='bbox 213 2319 1239 2359'>Painting of office unit at 12 Woodridge Rd. Price includes</span><br />
<span class='ocr_line' title='bbox 212 2241 638 2281'>- All surface preparation</span><br />
<span class='ocr_line' title='bbox 212 2167 1315 2206'>- 1 undercoat and 2 finishing coats to the color of your choice</span><br />
<span class='ocr_line' title='bbox 212 2089 967 2129'>- Supply of paint and labourl workmanship</span><br />
<p />
<span class='ocr_line' title='bbox 212 1932 382 1963'>Remarks</span><br />
<span class='ocr_line' title='bbox 213 1876 1432 1912'>PAYMENT TERM4S : 30%4 deposit required to start work. Balance 70% on</span><br />
<span class='ocr_line' title='bbox 213 1829 400 1864'>completion</span><br />
<span class='ocr_line' title='bbox 216 1735 995 1770'>VALIDITY ; 90 days from the date of this quote</span><br />
<span class='ocr_line' title='bbox 217 1642 1550 1678'>Wre trust that you will find our quote satisfactory and look forward to working with</span><br />
<span class='ocr_line' title='bbox 212 1595 1226 1630'>you. Please contact us should you have any questions at all.</span><br />
<p />
<span class='ocr_line' title='bbox 214 1399 792 1433'>for ABC Painting and Renovations</span><br />
<p />
<span class='ocr_line' title='bbox 214 1118 813 1153'>to accept, please sign and fax back</span><br />
<span class='ocr_line' title='bbox 1752 3123 2188 3180'>OUOTATION</span><br />
<span class='ocr_line' title='bbox 1751 3066 2153 3102'>Biz Reg No : 1234-9876</span><br />
<p />
<span class='ocr_line' title='bbox 1751 2902 2204 2939'>GQuotation No. OT10000</span><br />
<span class='ocr_line' title='bbox 1752 2808 1833 2839'>Date</span><br />
<span class='ocr_line' title='bbox 1751 2745 1898 2777'>Our Ref.</span><br />
<span class='ocr_line' title='bbox 1751 2683 1915 2714'>Cust Ref.</span><br />
<span class='ocr_line' title='bbox 1751 2621 1862 2652'>Terms</span><br />
<span class='ocr_line' title='bbox 2017 2808 2212 2839'>17RO32008</span><br />
<p />
<span class='ocr_line' title='bbox 2191 2493 2344 2524'>Amount</span><br />
<p />
<span class='ocr_line' title='bbox 2145 2322 2344 2361'>$12,500.00</span><br />
<p />
<span class='ocr_line' title='bbox 1980 1862 2344 1900'>Tax $1,136.36</span><br />
<span class='ocr_line' title='bbox 1945 1798 2345 1837'>Total $12,500.00</span><br />
</div>
</body>
</html>

I assume that the bbox values are the position? I saw the following line in the code: info += "bbox %d %d %d %d"%(x0,y0,x1,y1)

@amitdo thanks for that link, I'll be reading that for sure!

@tmbdev I honestly think that is a major minus in this library then. The ability to use multiple cores would speed up this library by a lot! At this point this library is approx 10 times slower than tesseract.. With this library I need approx 50-60 seconds per page. So, could I implement your newer LSTM C++ project in Linux and what kind of improvements / differences will I see there against the Python version? I really like the library that you've written, a big :+1: for this. My major downside is performance at this point.

Yenthe666 commented 8 years ago

@tmbdev I was wondering, have you ever considered building the HTML exporter to re-create the same lay-out as the original PDF's? At this point I'm looking to parse a whole PDF document full of images to text with ocropy. I've found the ability to export data to HTML with ocropus-hocr but it doesn't consider any lay-out options. Could you give me some details / guidance regarding the values in bbox? Two example items: Date  17032008 

When looking at the code they match to x0, y0, x1 and y1 but could we somehow map this into an HTML document that has the same lay-out (approx) as the original image / PDF? I'd love some input to better understand the x0, y0, x1 and y1 values so we can build something for this. :smile: CC @zuphilip and @amitdo

avr248 commented 8 years ago

HOCR document with two columns (https://github.com/tmbdev/ocropy/blob/master/tests/testpage.png): link to result

http://www.visualinfo.co.il/book.html

wanghaisheng commented 8 years ago

@avr248 how do you transform hocr html into the original layout ones ？

manhcuogntin4 commented 7 years ago

@tmbdev I try to use ocropus-gpageseg to segment some image for preparing the dataset for lstm training. But I found that the image outputs of ocropus-gpageseg are not very correct. Sometime they lost the information in the output line file (For example in the original image the line is 15 September 2010 and the text is underlined but in the output of ocropus-gpageseg the image is 15 Se tember 2010 and the Se tember is not underlined. I want to know is this problem due to the ocropus-ggapeseg ? Is there any solution ? Thank you ! 010008 bin

zuphilip commented 6 years ago

This issue is now diverging into several very different angles:

For visualization of hocr files there is now an interesting project: https://github.com/kba/hocrjs
@manhcuogntin4 AFAIK ocropus tries to delete small lines and I guess that the underlining in September was trigger such a replacing of a line, but wrongly deleted the whole component including the p.

I close this issue here. If you want to continue any of the discussions from here, then please open a new issue.

ocropus-archive / DUP-ocropy

How to always read left to right? #80