meh / ruby-tesseract-ocr

A Ruby wrapper library to the tesseract-ocr API.
629 stars 74 forks source link

Strange behavior of Tesseract #55

Closed pienkowskip closed 8 years ago

pienkowskip commented 9 years ago

I suppose that this is more like tesseract engine (not this gem) issue but I hope you could help me to explain how this is happening. And it's funny. ;)

So the problem is when I reuse tesseract engine in script I get different text for this image (especially white part): 20151012t011003 The script attached walks through samples files and a few files are duplicated:

pablo@pablo-workstation:~/samples$ md5sum *
886bf62b449ddc301ec393703f65b665  201500000000000.png
f9819fb59eae9df920e61375d5b34390  20151010T122754.png
b71eba9b5e53e172dfadcd48a3b91943  20151010T123607.png
f9819fb59eae9df920e61375d5b34390  20151012T005143.png
886bf62b449ddc301ec393703f65b665  20151012T010407.png
baf8518efc384c2f0dfd5d7f8576a647  20151012T010817.png
886bf62b449ddc301ec393703f65b665  20151012T011003.png

At the beginning for file 201500000000000.png I get Hpisafi: Collect Call which is sth like proper solution. But later on for the same image and filename 20151012T010407.png I get Hpisafi: Dniiect Dali and the same for the last one 20151012T011003.png. If I reverse order (uncomment line 12) the things get more complicated because I get Hpisafi: Collect Dali. Isn't it funny?

Of course if I create new instance of tesseract engine for each file I get proper results: Hpisafi: Collect Call. And if I switch to Polish results are the same for the same image but not so good :(

So do you have any idea how this can happening? Can you reproduce that on your machine? I thought that OCR result should be "repeatable".

Attachments

Samples + script

https://drive.google.com/file/d/0BxJIH-bPcJwFZnJuaEZoaGEyVGc/view

Script
#!/usr/bin/env ruby

require 'tesseract'

e = Tesseract::Engine.new {|e|
  e.language  = :en
  e.blacklist = '|'
}

paths = Dir.glob('2015*.png')
paths.sort!
paths.reverse!
paths.each do |fname|
  puts 'file %s: %s' % [fname, e.text_for(fname).strip]
end

exit 0
Results over files in alphabetical order
file 201500000000000.png: Hpisafi: Collect Call
file 20151010T122754.png: Hpisafi: Diving Board
file 20151010T123607.png: Hpisafi: Birdie Putt
file 20151012T005143.png: Hpisafi: Diving Board
file 20151012T010407.png: Hpisafi: Dniiect Dali
file 20151012T010817.png: Hpisafi: Spring Training
file 20151012T011003.png: Hpisafi: Dniiect Dali
Results over files in reversed alphabetical order
file 20151012T011003.png: Hpisafi: Collect Call
file 20151012T010817.png: Hpisafi: Spring Training
file 20151012T010407.png: Hpisafi: Collect Call
file 20151012T005143.png: Hpisafi: Diving Board
file 20151010T123607.png: Hpisafi: Birdie Putt
file 20151010T122754.png: Hpisafi: Diving Board
file 201500000000000.png: Hpisafi: Collect Dali
meh commented 9 years ago

Want to have even more fun? Shuffle the paths.

From empirical evidence it's Diving Board that breaks the subsequent extractions.

meh commented 9 years ago

Also to "fix" the problem and degrade the performance into nothingness, you can create an engine for every OCR.

pienkowskip commented 8 years ago

Ok, this seems funny & unfixable. Closing issue.