openpaperwork / paperwork

Personal document manager (Linux/Windows) -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/paperwork
2.43k stars 149 forks source link

OCR incorrectly detects upside down scan as having a better score #101

Closed tiramiseb closed 11 years ago

tiramiseb commented 11 years ago

Since the correction of #95 (more than 2 scan angles), sometimes documents are "detected" upside-down.

Here is an example:

When a document is put in the scanner in the correct orientation:

JfBJnrflnlJñë-SSJQIUBQTMMM
Z H76 EIdV

‘7L000 LVOOLODSL 19J!S
anqnd wawassn e

Souväsz ‘909 sa Ilos “f 6179 1mm‘ 0073 EQVÔNVHäËMRWËŒ°N
3 €8‘86S s . sir 1300 9sww°PuI 51733179 ï “qmd”! “N
sq 0g“) sgnbav goïoo ‘en 0L L9 es L0 ï x23 1H18

HËIÀVJ V LEIN SLIOHG

SIUGUELUJSd BIBMUJBSSV

yepos aôçgs

suopnuasqo

ŒVISVSOJWI JÆIN
EIEIÀVJ V LEIN

(‘F95 1a .5) SJÆIN S210 "IVLOL
À 15m LNVLNOW

D8 + SEIIINCILEIH "IVLOJ.

uopespog
SCŒD JEDIV/

9€‘17Z0 E QIQHÔÛPG’? “ou 9S3
176‘Z€6 E

(3)1) axgmueçfcîtuo 9113 â
2102/80/19 - ZIOZ/SO/IO IOIdWELI E m0193 ‘W 913W

ZIOZ/LO/IE - ZIOZ/LO/IO . ISIIIHFHHV Iêddw
z1oz/9o/os - z1oz/9o/z1 umf 311v Iêddvl

ZIOZ/60/OIî LNŒIWŒHDŒRI zuva CHHHIOWO NVHI/DIH

1s 690110 2L 8o v8 z= ss oN IOZOVWOOO À 91H

ZI EIZ 7 19550!’ 9P oN ZIOZ 1U°V ï NOILVSINWEIGNLG EIGOIHŒIJ

 LNŒIWŒIIVJ EICI SIAV

LVLSHÎHS O09L9
LHHNHOH EIHH 17K

NOÉIVW HIAVÎH 911W

EIONVEH
EIHHJÎIÛOIHDVÆ] SBHSWVHO

SEIHIOLIHHEIL?
SEIHÛLÎHOIHQE

V

VDcIV

---
Got score of 15

---
.c.m._:::u_._mm-mm._nEmnu.>>>>>>
N ËË ma?

:95 S0302: 52m
52a ËmEmmm_ .m

oo✠umäuîœumäæuæîwäêoz

mïvo m mïäo m u_nsnonê_ «al
monmoo me E B æ S W Ê 2cm

En 3c 2.38m
mi. Œoo wmEEoeE
m: omwo 2:34.

MH><A < HHZ WËOŒQ

Ëaäa 6% mm :8.
w mæäm n

Ëmb mmLgEmco mmu
ÆcmcmELmm mEnEmmm<

_œ_uom cmæm

Ëosnîomno

määoæë Ëz

«M55. < E2
c; s a mËz wË 450e
Î Ëz s24e20:

0M + mHDZHHHÆ AËHOH

n59?‘ \ casæsao

wûMU
sessän son omo
oœnm _ assgaæov oœo

Il a 5.52.:

mm5<zomæ<m
mZOËÀwMHOU

t1 A05 Ëâaoaflmieag oämbwm‘
Nage: m . Nssæäo xäam; a 2.80m â 25..
Ëomëoäm . «S325 . po=sfimm< «âme
Soäoam . «:552 E3 m5.. woaae

HZËHZOË NDJÊ mœæm mnoEË mËmEq

Sosäsî HZHËHAUHM ËÊ E mäæozo zäîoä

a a3 :0 æ ä a à æ oz E Hoääoævoe ÎËËQ Ë.
N_ ÜN u hämmofi ou oZ NZVN pco< n ZOmH<wuZËHQZËQ HQOmMflm

è ËHEEË Ë m5,...

Ëämäœ comme
HŒŒZŒOŒ max â

zom<2 md><qm e52

äzË
mœ=sao_œu<_g œämëæo

mmϙcmmmpa
mwmafisflzœom

Î

<9?

---
Got score of 3

---
APCA

A

BGRICULTURES
&TERRITOIRES
CHAMBRES D'AGRICULTURE
FRANCE
XXXXXX XXXXX
XXXXXXXXXXX
XXXXXXXXXXXX
AVIS DE PAIEMENT 
PERIODE D'INDEMNISATION :Août 2012 No de dossier XXXXX
BIC/IBAN DATE REGLEMENT : XX/XX/XXXX

rappel ARE juin XX/XX/XXXX - XX/XX/XXXX
rappel ARE juillet ' XX/XX/XXXX - XX/XX/XXXX

Aide au Retour à l'Emploi XX/XX/XXXX - XX/XX/XXXX

***

traite complémentaire (RC)

PATRONALES CSG déductible l
CSG non déductible (a)

CRDS

/\
C"
sa

Cotisation / AREF

TOTAL RETENUES + RC

MONTANT NET ‘

TOTAL DES NETS (* et **)
NET A PAYER

NET IMPOSABLE

Observations

Siège social

Assemblée Permanente

DROITS NET A PAYER

Brut Fax  X X XX XXX,XX Acquis XXX Jrs
X XXX,XX indemnisé XXXX Jrs X XXX,XX €

Net imposable
soit XX XXX, XX flancs

NOmWËÆgiŒyÊFRANÇAISE XX ,00 Restant XXX Jrs

a Issement public
Siret 180070047 0001A

APE 9411 Z
www.chambres-agriculture.fr

---
Got score of 13

---
>23

X

momacñczmm

mqmmzäzmmœ

ozzsœmmœ osmæoczcmm
Ëzom

Ëzo mh><=m Ë>ŒOZ

Ë wcm œowzmwa
S25 mmrmmïä

>Sm un wänämza _.
EËQË cgzunzzæïäz räa 85 2o â noua: 5:5
w_o:w>2 0399B E ses xmnrmämza Logääs
Ëärrm «Œzovm wàm Ëcx ËOZHÈAH

32è 2cm E: 55.305 . äseäs
æôvï 2cm ËEQ _ ÊoSoS . ÊoSos
>50 m: W085 m. EWBES Êoœsos . w zoœsos
Wnîmzo ooävaäosnmwno 90V t;

Il à cs3

OOHËËËOZŒ
W>HWOZ>FHŒ

HOËËL WWCH A:

8o amasomza
nmo son amaäîo
96m

æxrx
UN
szx/

Communes \ >wmm

HOH>F WHHHZŒHŒ + N0

ËOZËVZH 25 ,
HOËËL 9mm 25m î a i; T:
2.3, > n52?

25 ESŒOM>WFH

Ocmîäszozm

mämm menu.
Ïmmäïmm nmwämamaflm
amm ozϊawmm .35

UWOËŒ 25 > Œ><HW

\ . ä✠.1. 05
m2: m; M m: 8 S a ow oombm ËÊ: ode ï
2a _a..ê._vs w aäxæ w Ëä Ëmnäium os: î u æææ m
Zoägmäwmämnmänamm œroo W835 En Ëm mon Nm me.» mœævmxna

m. mmmâmnfl ucczn
mi: 250363 08E

Eum É: N
E22.n:m3c_.mm-mm_._n:::_.m.:.

---
Got score of 3
Best: 15.000000 -> /home/sebastien/Documents perso/Papiers/20121122_1000_56/rotated.2.bmp

Funny thing is that, if I put the same document upside down in the scanner, the "upside down" (ie. correct-orientation-in-real-life) data has a better score :

APCA

A

aGRICULTURES

&TERR|TO|RES

CHAMBRES D'AGR|CULTURE
FRANCE

XXXXXXXX XXXXX

XXXXXXXXXXXXX
XXXXXXXXXXXXXXX

AVIS DE PAIEMENT 

PERIODE D'INDEMNISATION : Août XXXX No de dossier ; XXXXX

BIC/IBAN CMCIFRPP DATE REGLEMENT :XX/XX/XXXX

LIBELLE PERIODE BASE TAUX MONTANT

rappel ARE juin XX/XX/XXXX - XX/XX/XXXX
rappel ARE juillet A XX/XX/XXXX — XX/XX/XXXX
Aide au Retour à 1'Emp1oi XX/XX/XXXX - XX/XX/XXXX
l Retraite complémentaire (RC) ***

TOTAL BRUT (

PATRONALES CSG déductible X XXX,XX
l)
CSG non déductible (a) X XXX,XX
CRDS (b) ***

Cotisation / AREF

***

***

TOTAL RETENUES + RC

MONTANT NET ‘
TOTAL DES NETS (* et **) (4)
NET A PAYER (3)+(4)
NET IMPOSABLE

X XXX,XX
X XXX,XX

Observations

Siège social
Assemblée Permanente
des Chambres d'aricu

Fax Q 1 5510 XXX,XX X XXX,XX Acquis XXXX Jrs
Net imposable X XXX,XX X XXX,XX Indemnisé XXXX Jrs X XXXXX €
Nomqæpfl/ËÉWËFRANÇAISE XX,XX Restant XXX Jrs son XX XXX’ 78fiamS

a issement public
Siret180070047 00014

APE 9411 Z
www.chambres-agriculture.fr

---
Got score of 22

---
. wâæa Àooäîmäosnmwna 90V

>æn>

X

mofizncücxmw
mamzmäoæmœ
Ëzmmmœ o_>mm_ocrämm
232cm
2E0 mîæsm ZËŒOZ
Ë wcm œowzmwa
38e mmwmmëä
W >Sm cm wämzmza 
wmwôcm UAZUHËZŒNwHmOZ n >93 85 2o â .522 n m: 5
wHoEwË  E 95m wmornîmza n :3652»

ËOZHÈAH

Ewmrrn æmwëcm

53eme: - woaâuoa.
ÊoSoS . Ëîsos
Êoœsos - Ëoœsos

32.3..

5mm

83.3 2mm Ë»:
æñvï 2cm ËEQ
>50 m: W085 w emääâ

GOHŒÏËOZM
w>awoz>rmm

omo amasêza
Cm0 ses amêsæa Ê
95m Ê

w oærä

***

Ooammao: \ Ewmm

***

***

HOH>E WHHHZCMŒ + W0

ËOZHËAfl 25 .
HOËËL E5 25m Q. o» ÏQ T3
25 > wËmw A813
2.5 ESWOWPWFH

u mafia
m Îœku

Ovuonäñoî

mñmm monË
>mmm3ïmm nmwîmamswm
a3 orwäcwmm næmînc

. .,. 9m % 
Ex m m: 8 S a à e58 a 08.8 >nn=ï ode Ha
2o. mäcomwîm w îœä w 93.3 5.33.13 oc“: Ëm w mwœüu m
Zoäîælæmämmzänzmm œroo .35... _ mè 5m me: Nm mou. wœääznæ

m. Œmmämâ ucczn
wïmfl zsouaoä 25:.

>_um Ê: N
Eäsrnzmäcnmmfimmwäczcnœä.

---
Got score of 0

---
35.0..JSJUTËMuWŒLDEŒÆUÇSËË
N Cg mn_<

:25 29:99: äEm
uznzn. «cwficwmm. .m

ooJœ mmäoîmmæbædæîwäsoz

æàë m æàë m ozäcâä aoz
moäoo me S B 8 S m Ë

Ê. 9è 2 E381
m: _œoo ËËEËE
m: cm8 2=3<

950m9

Ëäxmu 6% R :8
w nϾam m

M555 < HHZ

sucâb mïnEmco mœv
œëœcmctmm ÆÆQEœmm/x

item «me»...

Ëorœîouno

mäâoäâ Ëz

MËÊÆ < Èz
o; 3 a mæmz ma: A<HOH
_ Ëz ezîzoz

ntæe m
mϊm m
ce...

U1 + mmazæeæm AËHOH

n52 \ sêæsao

œomo
sâosvmc 5.. omo
ozânnæu omo âäzoœîæ

n: HDMŒ 4490H mZOCZwuHOD

1*.  .  ‘oämumoïæfimflou uxmhom _
Nsmæozm . 22522:0 xzaäm; a E332 â “a2
20285 . «S22E22 2 8:2; mp2 mage

2o2osom - 2o2oo2 sa 52 3&8

HZËHZOË Ëqmmä

msmaoaî æzmämqomz ËÊ H Èäôäo 25:55
58 :0 23g 2 mfz E zävmomoeoîäägmæ

22m u .32... ë oz 2cm xa< u zoäkæzämnzrn mmoËË

ï

azmzäfi. mû Œ>< W

Ëemmäm 08S
HŒŒZŒOŒ ŒDM ä

ZOm<Ë m5252 2:2

äzäm
äääfizœwsa Ëœëäo

mmœätœœmha
mmœîäflgœom

Y

<om<

---
Got score of 1

---
JfôJHInDDQJÜE-SBJQUJBHTMMM
Z L176 ÉldV

7m00 LVOÜLOÜBL 13J!S
anqnd wawassx e

0618 aswônvuagæmlwgmwm

{V8179 E {V8179 E °l‘l”5°d“’! 3°N
go‘g00 a0 0L L9 es Lu f X93

s11‘ 5179 À 1081598
S11‘ [800 çsguwapu]
Sïf OSLO Slnbûv

SLIOHG

0900451 ‘909 sz nos
a €8‘86S s

HEIÀVcI V JÆIN

e1uaueuJJed SISMUJBSSV
papas añçgs

suopeuasqo

znavsoawl mm

zmwa v MIN
(#55 1a I5) SJJHN S210 "W101
v un mvmnow

OH + SEIIINŒILEIH 1V10l.

HEDIV / n°939103

sans
(e) 9IQII°T1P9P uou 9S0
“qmnpçp 9S3 SEITVNOHLVJ

(1) 1H88 1V.LO.I. SNOILVSILOD

/\
.0
æz

50500 v --

H;  1199101119 ‘ U

ZIOZ/SO/IE - 2102/80/10 10141111310 m0198 r10 apw
ZIOZ/LO/IE - ZIOZ/LO/IO 1 1911m! Emv Iêddvl
z1oz/9o/os - 2102/90/21 umf 311v Iêddw

LNVLNOW XflVl. EISVH EICIOIHEIJ 511151811

ZIOZ/60/0Iï LNEIWEHDEIH zuva ddHHIOWO NVHI/DIH
19690110 9L 80178 z= ssoN 10500505000031“

ZISIZ ï M5500 0P oN ZIOZ 100V = NOILVSINWEIGNLCI EIGOIHŒIJ

 LNCËIWEIVJ HG SIAV ‘

LVLSËHËIS 0091.9
LHEINÈIOEI HHH 172

NOEIVW EIIAV1:I 911W

BÛNVHA
EIHHIIHOIHÛWCI SEIHEIWVHO

SEIEIIOLIHHBJË
SHHHJÏIÜOIHOË

V

VOcW

---
Got score of 12
Best: 22.000000 -> /home/sebastien/Documents perso/Papiers/20121122_1008_26/rotated.2.bmp

(I've replaced some numbers, addresses, names, etc by "XXX" in order to keep private stuff private :-) if you see some "XXXXXX", that's me, not a misdetection by the OCR).

Sometimes, when I see a document is wrongly detected upside down, I just erase it and rescan it upside down. But it's really annoying.

jflesch commented 11 years ago

Hm, one way to fix that would be to go back to only 2 orientation tries, but I don't like it. Being able to put the sheet in any orientation is just too handy.

However, I see 2 things I can to do in order to fix that or at least mitigate this problem:

tiramiseb commented 11 years ago

Implements #89 (page rotation on demand)

It would help. A little.

Use a spell checker like python-enchant to figure out which orientations gives the best number of real words.

I give a big "YAY" to this one

jflesch commented 11 years ago

I've implemented page orientation guessing using a spell checker (python-enchant). There are 2 new dependencies:

Can you give it a try and tell me if it works fine for you ?

Done in: 5030956802dd0b44bdc62a88d232006c35a29cb3 f106a58e4491bab794082564a551351e26cd76e2 25dd384ca0103ba4782584ad5590a4a864f59747

New tickets:

105

106

tiramiseb commented 11 years ago

Paperwork is working, but the result seems to be worse...

The wrong orientations get dangerously high scores:

Spell checking: Replacing: zozm -> zozo
Spell checking: Replacing: wcora -> écora
Spell checking: Replacing: ZOZŒ -> ZOZO
Page orientation score: 77

Spell checking: Replacing: HEBHOS -> HEBDOS
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: SEIT -> SET
Page orientation score: 87

Spell checking: Replacing: zocm -> zoom
Spell checking: Replacing: zocm -> zoom
Page orientation score: 81

Spell checking: Replacing: lntra -> entra
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: BORNERT -> BORNER
Spell checking: Replacing: ensouhaitons -> en souhaitons
Spell checking: Replacing: sincèces -> sincères
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: STRASËOURG -> STRASBOURG
Page orientation score: 103

Here, the correct orientation has the better score.

Spell checking: Replacing: ïoom -> boom
Page orientation score: 137

Spell checking: Replacing: semes -> semés
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: Buea -> Busa
Spell checking: Replacing: aæxa -> axa
Spell checking: Replacing: ason -> son
Spell checking: Replacing: LUON -> MUON
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snou -> sou
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snou -> sou
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: nbne -> none
Spell checking: Replacing: LuLu -> Lu Lu
Spell checking: Replacing: ouop -> ou op
Spell checking: Replacing: nsse -> nasse
Spell checking: Replacing: sueq -> sues
Page orientation score: 297

Spell checking: Replacing: mxen -> mien
Page orientation score: 214

Spell checking: Replacing: BORNERT -> BORNER
Spell checking: Replacing: Groupama -> Groupa ma
Spell checking: Replacing: enode -> encode
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: sonfé -> sondé
Spell checking: Replacing: RADL -> RADA
Spell checking: Replacing: HOSP -> HOP
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: Pharma -> Charma
Page orientation score: 276

Here, the upside down orientation gets a better score. I've tried to scan this document upside down, the "correct orientation" score is still less than another orientation (this time, the "90°" orientation has a better score !!!) Of course, this last document has no vertical nor upside down text. I can't scan this document (which I'm sure would have been detected correctly without spell checking, because it is a postal letter with plenty of text).

I haven't looked at how you've implemented this, but I suggest (if it is not already done, and if it's feasible) that correct words (which don't need to be corrected by the spell-checker) increase the score and words that have been corrected do not modify the score or decrease it...

Oh, and OCR is back to using only one process...

tiramiseb commented 11 years ago

I also suggest than score of pages with many scrambled words would be decreased...

jflesch commented 11 years ago

Hm, are you using Tesseract or Cuneiform for OCR ? (if both are installed, Paperwork will go for Tesseract).

Also, as suggested, I've made changes so misspelled words reduce the overall score of the page: 354f8a37535738df6f122f51cb8c6cc2659a240f Can you tell me if it improves results ?

tiramiseb commented 11 years ago

I'm using Tesseract.

It seems better now. With the same document:

Spell checking: Replacing: azma -> aima
Spell checking: Replacing: osäs -> osas
Spell checking: Replacing: msoc -> soc
Spell checking: Replacing: mrmo -> mémo
Spell checking: Replacing: mcmo -> mémo
Page orientation score: 130

Spell checking: Replacing: gnues -> nues
Spell checking: Replacing: sajnas -> saunas
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: æoot -> foot
Spell checking: Replacing: sJan -> s Jan
Spell checking: Replacing: sJan -> s Jan
Spell checking: Replacing: SUOO -> SUMO
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snou -> sou
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snou -> sou
Spell checking: Replacing: Anod -> Anode
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: Anou -> Anjou
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: Anod -> Anode
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: nbne -> none
Spell checking: Replacing: ouop -> ou op
Spell checking: Replacing: snoA -> snob
Spell checking: Replacing: Anod -> Anode
Spell checking: Replacing: sueg -> sue
Page orientation score: 188

[ pas de "spell checking" sur cette orientation cette fois-ci ]
Page orientation score: 69

Spell checking: Replacing: Groupama -> Groupa ma
Spell checking: Replacing: Groupama -> Groupa ma
Spell checking: Replacing: Schiltigheîm -> Schiltigheim
Spell checking: Replacing: BORNERT -> BORNER
Spell checking: Replacing: Groupama -> Groupa ma
Spell checking: Replacing: eriode -> triode
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: Complement -> Complément
Spell checking: Replacing: RADL -> RADA
Spell checking: Replacing: HOSP -> HOP
Spell checking: Replacing: EXTE -> ESTE
Spell checking: Replacing: SEBASTIEN -> SÉBASTIEN
Spell checking: Replacing: chär -> char
Spell checking: Replacing: ëresses -> dresses
Spell checking: Replacing: ImpŸrtant -> Important
Spell checking: Replacing: depenses -> dépenses
Spell checking: Replacing: Pharma -> Charma
Page orientation score: 255
jflesch commented 11 years ago

Ok. Since adding options to rotate the image is another issue ( #89 ), I will now close this one. Please reopen if you still have problems.

jflesch commented 11 years ago

Actually, I just got an idea to improve orientation detection. I've change the way score are computed. Here is the new way:

I did some tests on some easy documents and some much harder, and it seems to give really good results. Can you give it a try and tell me if it works well for you as well ?

Done in 24d3536a12d3e2102c840bbff2c1fe6f3464bc5c

tiramiseb commented 11 years ago

It seems better with "normal" documents (the "correct orientation" score is far higher than other orientations - something like 19000 vs 200).

However, I get strange reactions on some pages, notably on title pages of multi-page documents where there are only 2 or 3 words, sometimes written with a fancy font. But I think problems with this specific type of page will only be solved by the ability to change document orientation manually... or by implementing an AI :-)

jflesch commented 11 years ago

Rotating the page is a different issue ( #89 ). Since I don't think I'll be able to do a better heuristic, I'm going to close this issue. Please reopen if I forgot something.