Many non-words in OCR results in work.idx_keyword

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. get a more or less  random set of works.  Eg based on similarity, see
query below
2. check their idx_keywords
3. Notice thatthe many  frequent keywords are non words. This could be 
correct fro some works, but it seems that the ratio words / non-words is
quite smal:

vvv 17 
iii 16 
777 15 
vvvv 15 
7777  13 
vvvvv 13 
444 9 
aaa 9 
aaaa 9
all 9

What is the expected output? What do you see instead?
I woudl expect  words. Instead i see many sequences of letters and many
sequences of digits, see above. 

Of course the results of OCR  woll always be very dependant on the
qualities of the original works. Still, at least some filtering to remove
non-words could improve the value of OCR for GAMA, either  in the DB
adapters,  or other ways of enhancing OCR in the OCR indexer. (Villiam can
you add a new component category,  Component Indexers or something like
that, to also assign issues to Indexing.  ?)

Please use labels and text to provide additional information.

-----------
PREFIX gama: <http://gama-gateway.eu/schema/>
SELECT
  ?work_uri ?bestmatch
WHERE {
  ?manif_uri a gama:Manifestation.
  ?similar_manif a gama:Manifestation.
  ?work_uri rdf:type gama:Work;
            gama:has_manifestation ?similar_manif.
  FILTER gama:similar_media( ?manif_uri, 8, ?similar_manif, ?weight,
?bestmatch).
  FILTER (?manif_uri =
<http://www.gama-gateway.eu/montevideo/main/Manifestation_3367> ).
}
GROUP BY ?work_uri  order by desc(?bestmatch)
    LIMIT 20

Original issue reported on code.google.com by toon...@gmail.com on 27 Jan 2009 at 3:10

GoogleCodeExporter commented 9 years ago

Added new component to the issue tracker Component-Indexing

Original comment by viliam.s...@gmail.com on 28 Jan 2009 at 7:53

Added labels: Component-Indexing

GoogleCodeExporter commented 9 years ago

IDX keywords are still very polluted, what is the status of the work on the IDX 
keywords? Is this the final 
version, or are there improvements planned? 
Judging by feeling (not very objective, I know) I even et the impression that 
it has gotten worse.

gama:montevideo:main:Work:733:
"SIVDID", "VVV", "KDM", "MVDISSMSDGY", "MAIMA", "DEDFDDMAMCE", "DMKDDAKVMF", 
"MADCH", 
"IDVISDADIISIEVAM", "MADIMA", "ADDAMDVI.", "IIII", "AMD", "IGID", "VVVVV", 
"VVVV", "GASIEDEM", "MVM", "III"

gama:argos:main:Manifestation:2671:
"AMDDIM", "KIMDS", "CDDDM", "MADIIM", "MADIAMME", "VDVK", "DDDDVCED", "AAA", 
"IEDEED", "CFD", 
"VEDSVS", "MADC", "CIMEMA", "DEDAIIICE", "VAIIDMS", "DDDDIICED", "IEDDV", 
"SACDE", "AVDEIIE", 
"IMDVCIIDM", "SAVADDGD", "DEMDII", "DASIIEM", "CIMEMAIDGDADHV", "GEMI", 
"DESIGM", "SDDS", 
"FDAMCAISE", "CAIDVSSE", "CDMMVMAVIE", "MAIIHIAS", "EDIC", "DDDDVCIIDM", 
"ICDIAS", "HEMSGEMS", 
"GDDS", "MVDIEI", "EVE", "ADME", "IIIIII", "DDMIMIDVE", "SGIIIFI", "EDIIDD", 
"DDDMCDADI", "EMMAMVEIIESDEDS", 
"IDDEMSA", "SADAH", "DAIDICK", "DDDVDSI", "CED", "DISSDI", "AIIK", "IIME", 
"CIIIES", "CEMI", "IHE", "DES", 
"VVVVV", "DDDDVCEDS", "MADIDM", "IIIII", "IIV", "HEMDI", "DIIIECIIMG", "CDCA", 
"DECIEDCK", "MADCEI", 
"EKECVIIVE", "VVVV", "CIME", "III", "VVV", "VIM", "ACCDVMIAMI", "IIMED", "VII", 
"EMGIMEED", "MICDIAS", 
"EDIIIMG", "KDDISIE", "HAMSEI", "VICIDDIA", "IEMAEDIS", "DIIVIED", "ADIISI", 
"MACHIEM", "IHAMKS", "DIDECIDD", 
"VAMMICD", "CEMIDE", "ASSISIAMI", "VVVIVEDSVSDDDDVCIIDMIDE", "CDIDD", "SDVMD", 
"VIIH", "SVDDDDI", 
"DEIED", "DDDMCKADI", "IEIEDISIDIDVIEVDS", "ISSAKA", "CDIA", "IEDDME", "EVMAVDI"

Original comment by charles....@kmt.hku.nl on 8 Jul 2009 at 8:56

GoogleCodeExporter commented 9 years ago

The latest version is not yet in. This means for me to reanalyse completely, so 
my
current plan is to do this after switching to the 9.04 machine as this also 
serves as
a stress test for the new installation and I hope that, regarding the memleak 
issue
and also quality of recognition, the newer tesseract version shipped with 9.04 
will
bring some improvement. So expect new data in ~2 weeks ...

Original comment by alu...@gmail.com on 8 Jul 2009 at 1:29

GoogleCodeExporter commented 9 years ago

Just to add...

We've just found that the version of the filter that has been previously 
delivered to
Andree, needs some updates in terms of adapting it to the current version of the
indexing engine. I've just asked my colleague, Andrzej Głowacz (AGH) to adapt 
the
script. This seems to be a simple thing so I believe you can expect this soon.

Anyway, some of the words found by Charles do not look to be problematic (IMO).
Examples: "AMD", "DES" or "EVE".

Some of them also are definitively not easy to be filtered out, as they relly 
"look
like" the regular words. The only way to filter them out is to use a dictionary 
(what
we didn't want to do, in order to keep all the "own-names"). Examples: "SIVDID",
"KIMDS", "CIMEMA", "GEMI", "ISSAKA", "HAMSEI", "CIME", "EDIC", "CED".

Original comment by mikolaj....@gmail.com on 9 Jul 2009 at 6:31

GoogleCodeExporter commented 9 years ago

I've integrated the new version of filtering scripts.

Old output for the argos video above (gama:argos:main:Manifestation:2671) was:
111
2005
444
aaa
accountant
alix
andrin
arne
artist
assistant
aurelie
benoj1
bronckart
catrvsse
ceb
cent
centre
cfb
cilles
cine
cinema
cinematographv
coca
codon
cola
color
communaute
declerck
derattice
des
design
diiiecting
director
dominioue
editing
editor
emmanueliespers
engineer
eric
eve
evnaud1
evnaudi
executive
francaise
gdos
gent
hansel
henri
hensgens
icdlas
iii
iiiii
iiiiii
iiv
induction
issaka
jerome
kinds
kortste
lebeer
lenaerts
lerov
line
lorenza
marc
marcel
marianne
marion
martin
matthias
muriel
nachten
nlcolas
olivier
patrick
peter
prodlicer
produced
producer
producers
production
provost
rastien
rissot
rroncrart
sacre
sarah
sawadogd
sgiiift
sound
support
teledistributeurs
thanks
the
timer
vannicr
versus
victoria
vii
vovk
vvv
vvvv
vvvvv
wallons
wim
with
www
www1versusproduction1be

New output is:
0
1
11
13
14
1ean—francois
1r
1·
1‘
2
2·
3
4
41
44
4·
4»
4‘
5
6
7
71
7·
9
acc¤u»m1m
ag
alix
andrin
arne
artist
arto
assistant
aud
aurelie
av
a·
b
bastien
belcioue
belcique
benoit
bissot
boutet
bronckart
c
catrvsse
catrysse
cent
centre
cfb
cine
cinema
cinematography
coca
codon
cola
colasprovost
color
communaute
d
debattice
declerck
directing
director
dominioue
dominique
editing
editor
ei
emmanuel
engineer
eric
ev
eve
executive
eynaudi
e·
f
ff
fg
fi
fj
fl
fr
francaise
fw
f¤
f·
f‘
g
gg
gi
gilles
goos
gy
g·
h
hansei
hansel
henri
hensgens
i1
i4
ia
iean
iespers
ig
ih
ii
iie
ij
iq
issaka
iv
iy
i·
i·i
i»
i‘
i’
j
jacque5—henrl
jean
jerome
jespers
jf
jg
jl
jq
jr
j‘
k
kinds
kortste
l
l4
lebeer
leiieer
lenaerts
leroy
lf
lg
li
line
ll
ln
lorenza
lr
lv
m
marc
marcel
marianne
marion
martin
matthias
michel
muriel
n
nachten
nf
ng
nicdlas
nicolas
nicolasprovost
nr
olivier
p
patrick
peter
produced
producer
producers
production
provost
q
qi
qq
qt
qv
q·
q‘
r
rf
ri
rr
r·
r»
r‘
s
sacre
sarah
sawadogo
schoffeniels
script
sound
support
s·
t
teledistributeurs
tellin
thanks
ti
timer
tk
tv
tw
t·
ues
up
v
v7
vannick
versus
versusproduct
versusproduction
vf
vg
vi
victoria
vii
viv
vj
vl
vovk
vr
vt
vv
vw
v·
v»
wallons
wg
wi
wig
wim
wl
wr
ww
w·
x
y
yi
yn
yr
yy
¢
¢·
¢‘
£
¤
¤f
¤r
¤¤
¤·
¤»
¤—
¤‘
¤•
¥
¥·
§
§·
§‘
°
·
·1
·4
·a
·e
·f
·g
·i
·l
·r
·s
·t
·v
·w
·y
·¢
·¤
·¥
·§
··
··¤
··»
··—
··‘
·»
·»·
·—
·‘
·’
·•
·•·
»
»i
»v
»¢
»¤
»§
»·
»··
»»
»—
»‘
—
—i
—¤
—·
——
—’
‘
‘i
‘r
‘v
‘vv
‘¤
‘§
‘°
‘·
‘»
‘•
’
’v
’y
’·
“
“‘
”
•
•1
•¤
•·
•»
•‘
•’
••
€
€·

I actually have the feeling that in the new output lots of meaningsless special
characters occur that could easily be filtered ...

@Toon&Charles: What is your feeling as you've opened this isuue?

Original comment by alu...@gmail.com on 20 Jul 2009 at 8:31

GoogleCodeExporter commented 9 years ago

I think we deal with three versions; the one I reported which is not very good. 
The current one which seems 
pretty ok, and the last one that seem ok, but worse the the current one. At 
least, from this sample.

It would be good to determine how this happens. The way I see it the process is 
roughly two parts: indexing and filtering the bad ones. I can't say anything 
about the filtering from this sample, but the filtering the bad words 
seems worse. In other words, is it worth it to check the new results with the 
current filtering and take a look at 
the differences then?

Original comment by charles....@kmt.hku.nl on 20 Jul 2009 at 9:46

GoogleCodeExporter commented 9 years ago

To proceed further, we first need Andree's indexing output from _all_ the GAMA
content _saved_ separately in *.txt files. One directory per movie and one txt 
file
per frame.

This is a step, where we have most integration problems.

After we have ocr output, we can run newest version of the filtering and 
compare results.

Original comment by anc2...@gmail.com on 21 Jul 2009 at 7:16

GoogleCodeExporter commented 9 years ago

Sure, the latest versions saves the raw output per frame, so the filtering can 
then
be touched without reanalysing. This will be installed on the new 9.04 machine
beginning of next week and then I'll reanalyse. Early August you can then start 
a
deeper analysis on this issue. For now I've integrated the update you've sent
yesterday. Here's the output on the video above from the new version:

1ean—francois
44_
4_4
acc¤u»m1m
ag_
alix
andrin
arne
artist
assistant
aurelie
bastien
belcioue
belcique
benoit
bissot
boutet
bronckart
catrvsse
catrysse
cent
centre
cfb
cine
cinema
cinematography
coca
codon
cola
color
communaute
debattice
declerck
directing
director
dominioue
dominique
d_arto_s
editing
editor
ei_
emmanuel
engineer
eric
eve
executive
eynaudi
fi_
francaise
gg_
gilles
gi_
goos
hansei_
hansel
hensgens
ia_
iean_michel
iespers
if_
ii_
ii__
issaka
iv_
i_i
jacque5—henrl
jean_michel
jerome
jespers
jq_
kinds
kortste
lebeer
leiieer
lenaerts
leroy
lf_
line
ln_
lorenza
make_up
marc
marcel
marianne
marion
martin
matthias
muriel
nachten
nicdlas
nicolas
olivier
patrick
peter
produced
producer
producers
production
provost
rf_
sacre
sarah
sawadogo
schoffeniels
script
sound
support
teledistributeurs
tellin
thanks
timer
vannick
versus
ve_
victoria
vii_
viv
vi_
vovk
vv_
vv__
vw_
v_v
wallons
wim
wi_
yr_
_11
_44
_ii
_vv
_vv_
__44

Original comment by alu...@gmail.com on 21 Jul 2009 at 8:19

GoogleCodeExporter commented 9 years ago

The results are better, however filtering is less restrictive in some cases for
shapex functionality. Level of detail can be further adjusted, but let us not 
expect
perfect dictionary words matching. Remember, that the input obtained from OCR is
_very_ messy for low-quality video.

Original comment by anc2...@gmail.com on 21 Jul 2009 at 11:25

GoogleCodeExporter commented 9 years ago

Nobody expects perfect matching, and it isn't needed, but the previous results 
were very messy and caused 
unnecessary bloat. The result above look very good. Maybe some filtering on the 
_ and it looks very close to 
perfect. If there is no reason to keep it in of course. Nice work!

Original comment by charles....@kmt.hku.nl on 21 Jul 2009 at 12:27

GoogleCodeExporter commented 9 years ago

Below you will find new results. Some additional filtering on the _ was carried 
out.

accountant
alix
andrin
arne
artist
assistant
aurelie
benoj1
bronckart
catrvsse
ceb
cent
centre
cfb
cilles
cine
cinema
cinematographv
coca
codon
cola
color
commlinaute
communaute
d_arto1s
d_artois
declerck
derattice
director
dominioue
editing
editor
emmanueliespers
engineer
eric
eve
evnaud1
evnaudi
executive
francaise
gdos
gent
hansel
henri
hensgens
icdlas
iiv
induction
issaka
jacques_
jacques_henri
jean_francois
jean_michel
jerome
kinds
kortste
l_audiovisuel
lebeer
lenaert5
lenaerts
lerov
line
lorenza
maiie_up
marc
marcel
marianne
marion
martin
matthias
muriel
nachten
nlcolas
olivier
patrick
peter
prodlicer
produced
producer
producers
production
provost
rastien
rissot
rroncrart
sacre
sarah
sawadogd
schoffeniels
sound
support
teledistributeurs
tellir
thanks
timer
vannicr
versus
victoria
vii
vovk
wallons
wim

Original comment by anc2...@gmail.com on 23 Jul 2009 at 1:33

GoogleCodeExporter commented 9 years ago

Toon, Colleagues,
Have you got any further comments to the OCR filtering.
If not, I'd kindly ask Toon for closing the issue.
Kind regards,
Mikołaj

Original comment by mikolaj....@gmail.com on 23 Jul 2009 at 10:08

GoogleCodeExporter commented 9 years ago

Colleagues,
Can we close this bug?
Toon?
Regards,
Mikołaj

Original comment by mikolaj....@gmail.com on 28 Jul 2009 at 9:51

GoogleCodeExporter commented 9 years ago

@mikolaj, 
Me or Charles will look again into this issue once back from holidays  and will 
close close if all is well.
regards
Toon

Original comment by toon...@gmail.com on 3 Aug 2009 at 7:34

GoogleCodeExporter commented 9 years ago

Original comment by alu...@gmail.com on 19 Aug 2009 at 10:32

Changed state: Fixed

vsimko / gama-gateway

Many non-words in OCR results in work.idx_keyword #7