Strange behavior of Tesseract

pienkowskip commented 9 years ago

I suppose that this is more like tesseract engine (not this gem) issue but I hope you could help me to explain how this is happening. And it's funny. ;)

So the problem is when I reuse tesseract engine in script I get different text for this image (especially white part): 20151012t011003 The script attached walks through samples files and a few files are duplicated:

pablo@pablo-workstation:~/samples$ md5sum *
886bf62b449ddc301ec393703f65b665  201500000000000.png
f9819fb59eae9df920e61375d5b34390  20151010T122754.png
b71eba9b5e53e172dfadcd48a3b91943  20151010T123607.png
f9819fb59eae9df920e61375d5b34390  20151012T005143.png
886bf62b449ddc301ec393703f65b665  20151012T010407.png
baf8518efc384c2f0dfd5d7f8576a647  20151012T010817.png
886bf62b449ddc301ec393703f65b665  20151012T011003.png

At the beginning for file 201500000000000.png I get Hpisaﬁ: Collect Call which is sth like proper solution. But later on for the same image and filename 20151012T010407.png I get Hpisaﬁ: Dniiect Dali and the same for the last one 20151012T011003.png. If I reverse order (uncomment line 12) the things get more complicated because I get Hpisaﬁ: Collect Dali. Isn't it funny?

Of course if I create new instance of tesseract engine for each file I get proper results: Hpisaﬁ: Collect Call. And if I switch to Polish results are the same for the same image but not so good :(

So do you have any idea how this can happening? Can you reproduce that on your machine? I thought that OCR result should be "repeatable".

Attachments

Samples + script

https://drive.google.com/file/d/0BxJIH-bPcJwFZnJuaEZoaGEyVGc/view

Script

#!/usr/bin/env ruby

require 'tesseract'

e = Tesseract::Engine.new {|e|
  e.language  = :en
  e.blacklist = '|'
}

paths = Dir.glob('2015*.png')
paths.sort!
paths.reverse!
paths.each do |fname|
  puts 'file %s: %s' % [fname, e.text_for(fname).strip]
end

exit 0

Results over files in alphabetical order

file 201500000000000.png: Hpisaﬁ: Collect Call
file 20151010T122754.png: Hpisaﬁ: Diving Board
file 20151010T123607.png: Hpisaﬁ: Birdie Putt
file 20151012T005143.png: Hpisaﬁ: Diving Board
file 20151012T010407.png: Hpisaﬁ: Dniiect Dali
file 20151012T010817.png: Hpisaﬁ: Spring Training
file 20151012T011003.png: Hpisaﬁ: Dniiect Dali

Results over files in reversed alphabetical order

file 20151012T011003.png: Hpisaﬁ: Collect Call
file 20151012T010817.png: Hpisaﬁ: Spring Training
file 20151012T010407.png: Hpisaﬁ: Collect Call
file 20151012T005143.png: Hpisaﬁ: Diving Board
file 20151010T123607.png: Hpisaﬁ: Birdie Putt
file 20151010T122754.png: Hpisaﬁ: Diving Board
file 201500000000000.png: Hpisaﬁ: Collect Dali

meh commented 9 years ago

Want to have even more fun? Shuffle the paths.

From empirical evidence it's Diving Board that breaks the subsequent extractions.

meh commented 9 years ago

Also to "fix" the problem and degrade the performance into nothingness, you can create an engine for every OCR.

pienkowskip commented 8 years ago

Ok, this seems funny & unfixable. Closing issue.

meh / ruby-tesseract-ocr