goncalopp / simple-ocr-opencv

A simple python OCR engine using opencv
GNU Affero General Public License v3.0
524 stars 175 forks source link

Different Number of lines #27

Closed MarkRatFelt closed 6 years ago

MarkRatFelt commented 6 years ago

Hi!

I really appreciate this project. I tested it with the examples and it works pretty good. I'm actually quite interested in getting the lines showing line starts and ends (waiting for input) because having the lines, I could experiments with LSTMs, which I already use.

The problem is that when I use other image, I get an error

Exception: different number of lines

I was debugging, and somehow

tops = self._guess_lines(segment_tops) 
bottoms = self._guess_lines(segment_bottoms)

are not the same (I mean, the length). But don't know why and how to fix it.

Maybe it's not a bug, but I need to do something else.

The pic I'm trying to work with is here:

0

It'd be ok if there are some errors, I could try to fix it if the lines are not sooo good and could make a pull request, but I'd be nice if it at least tries :D

MarkRatFelt commented 6 years ago

With a png image it works "fine."

So, I'm closing the issue. However, the lines were not good.

false_lines

But I can try to think a way to fix it :) However, with pics like that, the job is pretty hard.

RedFantom commented 6 years ago

While the error message is not very descriptive, as the library itself can't determine exactly how it was caused, I had this issue when using images that were not as high-quality as the example ones.

The library depends on (if I remember correctly, it's been a while) OpenCV Feature Matching. This means that a few things are important for the images you're trying to use:

These requirements in practice mean that when you would try to perform OCR on receipts using this library, it would be better to scan them. perhaps photographing them with the Office Lens (or the FOSS alternative Open Note Scanner which I haven't tried yet) would work too.

This library is not really as good as, say, Tesseract, and it's more of a learning tool than a real library still. It is useful in a limited amount of use-cases, but if yours is not one of these, looking at Tesseract or another library is probably a better idea.

MarkRatFelt commented 6 years ago

Hi! Thanks for your answer.

Yeah sure, I use the alpha version of tesseract, which implements the LSTM from Ocropus. And the result is actually not bad. However, the result could be improved if the lines or bunch of texts are found and deskewed, for example.

In the same pict, if I run "plain Tessract" I get this result:

©
Leipziger Strasse 76
01127 Dresden
MONTAG — SAMSTAG 8 — 21 Uhr

EUR
Milchdrink Banane 0,55 A
Kü.kernbrötchen 0,598 A 6

2 x 0,29 %
\ Mini—Calz. Pros 1,29 A
Ziegen—Rahm Kräuter 1,99 A
\ Milchr. Kokos—Schoko 0,39 A
'zu zahlen 4,80
Karte 4,80
MWST % MWST + Netto E Brutto
Summe , 37 es 49 —

W i ll i m”

1675 | 073767/02
— UST—ID—NR:; Del e

which is fine.

But if I get this:

tmp

and fix the rotation (and then of course the binarization, etc...) the result should be (in the theory :) ) better.

I was just wondering, if I get the lines and fix the rotation per line and then applying the LSTM just in the line, what happens.

And as this project detect some lines, I was taking a look to see if it could be optimized.

goncalopp commented 6 years ago

Hi MarkRatFelt,

I was just wondering, if I get the lines and fix the rotation per line and then applying the LSTM just in the line, what happens.

It's not doable at all, I'm afraid.

RedFantom already mentioned the most important point: this is a educational tool. It's meant to be easy to understand and tinker with, not to perform OCR with high accuracy.

If I recall correctly, the current line detection algorithm makes the following assumptions:

Your image breaks all of these assumptions. There's no way it will ever be able to do what you want.

If you're interested into developing a second line finding algorithm, I'd love to merge it in. It sounds like you already found the appropriate code locations. But I'd probably start looking into unwarping first.

Personally I don't think there's any way a hand-coded algorithm is going to beat the unwarping the LSTM is already implicitly performing with all the domain knowledge it has. Your current results are really good already.