Closed tiramiseb closed 11 years ago
Hm, one way to fix that would be to go back to only 2 orientation tries, but I don't like it. Being able to put the sheet in any orientation is just too handy.
However, I see 2 things I can to do in order to fix that or at least mitigate this problem:
Implements #89 (page rotation on demand)
It would help. A little.
Use a spell checker like python-enchant to figure out which orientations gives the best number of real words.
I give a big "YAY" to this one
I've implemented page orientation guessing using a spell checker (python-enchant). There are 2 new dependencies:
Can you give it a try and tell me if it works fine for you ?
Done in: 5030956802dd0b44bdc62a88d232006c35a29cb3 f106a58e4491bab794082564a551351e26cd76e2 25dd384ca0103ba4782584ad5590a4a864f59747
New tickets:
Paperwork is working, but the result seems to be worse...
The wrong orientations get dangerously high scores:
Spell checking: Replacing: zozm -> zozo
Spell checking: Replacing: wcora -> écora
Spell checking: Replacing: ZOZŒ -> ZOZO
Page orientation score: 77
Spell checking: Replacing: HEBHOS -> HEBDOS
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: SEIT -> SET
Page orientation score: 87
Spell checking: Replacing: zocm -> zoom
Spell checking: Replacing: zocm -> zoom
Page orientation score: 81
Spell checking: Replacing: lntra -> entra
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: BORNERT -> BORNER
Spell checking: Replacing: ensouhaitons -> en souhaitons
Spell checking: Replacing: sincèces -> sincères
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: STRASËOURG -> STRASBOURG
Page orientation score: 103
Here, the correct orientation has the better score.
Spell checking: Replacing: ïoom -> boom
Page orientation score: 137
Spell checking: Replacing: semes -> semés
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: Buea -> Busa
Spell checking: Replacing: aæxa -> axa
Spell checking: Replacing: ason -> son
Spell checking: Replacing: LUON -> MUON
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snou -> sou
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snou -> sou
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: nbne -> none
Spell checking: Replacing: LuLu -> Lu Lu
Spell checking: Replacing: ouop -> ou op
Spell checking: Replacing: nsse -> nasse
Spell checking: Replacing: sueq -> sues
Page orientation score: 297
Spell checking: Replacing: mxen -> mien
Page orientation score: 214
Spell checking: Replacing: BORNERT -> BORNER
Spell checking: Replacing: Groupama -> Groupa ma
Spell checking: Replacing: enode -> encode
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: sonfé -> sondé
Spell checking: Replacing: RADL -> RADA
Spell checking: Replacing: HOSP -> HOP
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: Pharma -> Charma
Page orientation score: 276
Here, the upside down orientation gets a better score. I've tried to scan this document upside down, the "correct orientation" score is still less than another orientation (this time, the "90°" orientation has a better score !!!) Of course, this last document has no vertical nor upside down text. I can't scan this document (which I'm sure would have been detected correctly without spell checking, because it is a postal letter with plenty of text).
I haven't looked at how you've implemented this, but I suggest (if it is not already done, and if it's feasible) that correct words (which don't need to be corrected by the spell-checker) increase the score and words that have been corrected do not modify the score or decrease it...
Oh, and OCR is back to using only one process...
I also suggest than score of pages with many scrambled words would be decreased...
Hm, are you using Tesseract or Cuneiform for OCR ? (if both are installed, Paperwork will go for Tesseract).
Also, as suggested, I've made changes so misspelled words reduce the overall score of the page: 354f8a37535738df6f122f51cb8c6cc2659a240f Can you tell me if it improves results ?
I'm using Tesseract.
It seems better now. With the same document:
Spell checking: Replacing: azma -> aima
Spell checking: Replacing: osäs -> osas
Spell checking: Replacing: msoc -> soc
Spell checking: Replacing: mrmo -> mémo
Spell checking: Replacing: mcmo -> mémo
Page orientation score: 130
Spell checking: Replacing: gnues -> nues
Spell checking: Replacing: sajnas -> saunas
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: sJan -> s Jan
Spell checking: Replacing: sJan -> s Jan
Spell checking: Replacing: SUOO -> SUMO
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snou -> sou
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snou -> sou
Spell checking: Replacing: Anod -> Anode
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: Anou -> Anjou
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: Anod -> Anode
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: nbne -> none
Spell checking: Replacing: ouop -> ou op
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: Anod -> Anode
Spell checking: Replacing: sueg -> sue
Page orientation score: 188
[ pas de "spell checking" sur cette orientation cette fois-ci ]
Page orientation score: 69
Spell checking: Replacing: Groupama -> Groupa ma
Spell checking: Replacing: Groupama -> Groupa ma
Spell checking: Replacing: Schiltigheîm -> Schiltigheim
Spell checking: Replacing: BORNERT -> BORNER
Spell checking: Replacing: Groupama -> Groupa ma
Spell checking: Replacing: eriode -> triode
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: Complement -> Complément
Spell checking: Replacing: RADL -> RADA
Spell checking: Replacing: HOSP -> HOP
Spell checking: Replacing: EXTE -> ESTE
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: chär -> char
Spell checking: Replacing: ëresses -> dresses
Spell checking: Replacing: ImpŸrtant -> Important
Spell checking: Replacing: depenses -> dépenses
Spell checking: Replacing: Pharma -> Charma
Page orientation score: 255
Ok. Since adding options to rotate the image is another issue ( #89 ), I will now close this one. Please reopen if you still have problems.
Actually, I just got an idea to improve orientation detection. I've change the way score are computed. Here is the new way:
I did some tests on some easy documents and some much harder, and it seems to give really good results. Can you give it a try and tell me if it works well for you as well ?
Done in 24d3536a12d3e2102c840bbff2c1fe6f3464bc5c
It seems better with "normal" documents (the "correct orientation" score is far higher than other orientations - something like 19000 vs 200).
However, I get strange reactions on some pages, notably on title pages of multi-page documents where there are only 2 or 3 words, sometimes written with a fancy font. But I think problems with this specific type of page will only be solved by the ability to change document orientation manually... or by implementing an AI :-)
Rotating the page is a different issue ( #89 ). Since I don't think I'll be able to do a better heuristic, I'm going to close this issue. Please reopen if I forgot something.
Since the correction of #95 (more than 2 scan angles), sometimes documents are "detected" upside-down.
Here is an example:
When a document is put in the scanner in the correct orientation:
Funny thing is that, if I put the same document upside down in the scanner, the "upside down" (ie. correct-orientation-in-real-life) data has a better score :
(I've replaced some numbers, addresses, names, etc by "XXX" in order to keep private stuff private :-) if you see some "XXXXXX", that's me, not a misdetection by the OCR).
Sometimes, when I see a document is wrongly detected upside down, I just erase it and rescan it upside down. But it's really annoying.