Closed GoogleCodeExporter closed 9 years ago
Added new component to the issue tracker Component-Indexing
Original comment by viliam.s...@gmail.com
on 28 Jan 2009 at 7:53
IDX keywords are still very polluted, what is the status of the work on the IDX
keywords? Is this the final
version, or are there improvements planned?
Judging by feeling (not very objective, I know) I even et the impression that
it has gotten worse.
gama:montevideo:main:Work:733:
"SIVDID", "VVV", "KDM", "MVDISSMSDGY", "MAIMA", "DEDFDDMAMCE", "DMKDDAKVMF",
"MADCH",
"IDVISDADIISIEVAM", "MADIMA", "ADDAMDVI.", "IIII", "AMD", "IGID", "VVVVV",
"VVVV", "GASIEDEM", "MVM", "III"
gama:argos:main:Manifestation:2671:
"AMDDIM", "KIMDS", "CDDDM", "MADIIM", "MADIAMME", "VDVK", "DDDDVCED", "AAA",
"IEDEED", "CFD",
"VEDSVS", "MADC", "CIMEMA", "DEDAIIICE", "VAIIDMS", "DDDDIICED", "IEDDV",
"SACDE", "AVDEIIE",
"IMDVCIIDM", "SAVADDGD", "DEMDII", "DASIIEM", "CIMEMAIDGDADHV", "GEMI",
"DESIGM", "SDDS",
"FDAMCAISE", "CAIDVSSE", "CDMMVMAVIE", "MAIIHIAS", "EDIC", "DDDDVCIIDM",
"ICDIAS", "HEMSGEMS",
"GDDS", "MVDIEI", "EVE", "ADME", "IIIIII", "DDMIMIDVE", "SGIIIFI", "EDIIDD",
"DDDMCDADI", "EMMAMVEIIESDEDS",
"IDDEMSA", "SADAH", "DAIDICK", "DDDVDSI", "CED", "DISSDI", "AIIK", "IIME",
"CIIIES", "CEMI", "IHE", "DES",
"VVVVV", "DDDDVCEDS", "MADIDM", "IIIII", "IIV", "HEMDI", "DIIIECIIMG", "CDCA",
"DECIEDCK", "MADCEI",
"EKECVIIVE", "VVVV", "CIME", "III", "VVV", "VIM", "ACCDVMIAMI", "IIMED", "VII",
"EMGIMEED", "MICDIAS",
"EDIIIMG", "KDDISIE", "HAMSEI", "VICIDDIA", "IEMAEDIS", "DIIVIED", "ADIISI",
"MACHIEM", "IHAMKS", "DIDECIDD",
"VAMMICD", "CEMIDE", "ASSISIAMI", "VVVIVEDSVSDDDDVCIIDMIDE", "CDIDD", "SDVMD",
"VIIH", "SVDDDDI",
"DEIED", "DDDMCKADI", "IEIEDISIDIDVIEVDS", "ISSAKA", "CDIA", "IEDDME", "EVMAVDI"
Original comment by charles....@kmt.hku.nl
on 8 Jul 2009 at 8:56
The latest version is not yet in. This means for me to reanalyse completely, so
my
current plan is to do this after switching to the 9.04 machine as this also
serves as
a stress test for the new installation and I hope that, regarding the memleak
issue
and also quality of recognition, the newer tesseract version shipped with 9.04
will
bring some improvement. So expect new data in ~2 weeks ...
Original comment by alu...@gmail.com
on 8 Jul 2009 at 1:29
Just to add...
We've just found that the version of the filter that has been previously
delivered to
Andree, needs some updates in terms of adapting it to the current version of the
indexing engine. I've just asked my colleague, Andrzej Głowacz (AGH) to adapt
the
script. This seems to be a simple thing so I believe you can expect this soon.
Anyway, some of the words found by Charles do not look to be problematic (IMO).
Examples: "AMD", "DES" or "EVE".
Some of them also are definitively not easy to be filtered out, as they relly
"look
like" the regular words. The only way to filter them out is to use a dictionary
(what
we didn't want to do, in order to keep all the "own-names"). Examples: "SIVDID",
"KIMDS", "CIMEMA", "GEMI", "ISSAKA", "HAMSEI", "CIME", "EDIC", "CED".
Original comment by mikolaj....@gmail.com
on 9 Jul 2009 at 6:31
I've integrated the new version of filtering scripts.
Old output for the argos video above (gama:argos:main:Manifestation:2671) was:
111
2005
444
aaa
accountant
alix
andrin
arne
artist
assistant
aurelie
benoj1
bronckart
catrvsse
ceb
cent
centre
cfb
cilles
cine
cinema
cinematographv
coca
codon
cola
color
communaute
declerck
derattice
des
design
diiiecting
director
dominioue
editing
editor
emmanueliespers
engineer
eric
eve
evnaud1
evnaudi
executive
francaise
gdos
gent
hansel
henri
hensgens
icdlas
iii
iiiii
iiiiii
iiv
induction
issaka
jerome
kinds
kortste
lebeer
lenaerts
lerov
line
lorenza
marc
marcel
marianne
marion
martin
matthias
muriel
nachten
nlcolas
olivier
patrick
peter
prodlicer
produced
producer
producers
production
provost
rastien
rissot
rroncrart
sacre
sarah
sawadogd
sgiiift
sound
support
teledistributeurs
thanks
the
timer
vannicr
versus
victoria
vii
vovk
vvv
vvvv
vvvvv
wallons
wim
with
www
www1versusproduction1be
New output is:
0
1
11
13
14
1ean—francois
1r
1·
1‘
2
2·
3
4
41
44
4·
4»
4‘
5
6
7
71
7·
9
acc¤u»m1m
ag
alix
andrin
arne
artist
arto
assistant
aud
aurelie
av
a·
b
bastien
belcioue
belcique
benoit
bissot
boutet
bronckart
c
catrvsse
catrysse
cent
centre
cfb
cine
cinema
cinematography
coca
codon
cola
colasprovost
color
communaute
d
debattice
declerck
directing
director
dominioue
dominique
editing
editor
ei
emmanuel
engineer
eric
ev
eve
executive
eynaudi
e·
f
ff
fg
fi
fj
fl
fr
francaise
fw
f¤
f·
f‘
g
gg
gi
gilles
goos
gy
g·
h
hansei
hansel
henri
hensgens
i1
i4
ia
iean
iespers
ig
ih
ii
iie
ij
iq
issaka
iv
iy
i·
i·i
i»
i‘
i’
j
jacque5—henrl
jean
jerome
jespers
jf
jg
jl
jq
jr
j‘
k
kinds
kortste
l
l4
lebeer
leiieer
lenaerts
leroy
lf
lg
li
line
ll
ln
lorenza
lr
lv
m
marc
marcel
marianne
marion
martin
matthias
michel
muriel
n
nachten
nf
ng
nicdlas
nicolas
nicolasprovost
nr
olivier
p
patrick
peter
produced
producer
producers
production
provost
q
qi
qq
qt
qv
q·
q‘
r
rf
ri
rr
r·
r»
r‘
s
sacre
sarah
sawadogo
schoffeniels
script
sound
support
s·
t
teledistributeurs
tellin
thanks
ti
timer
tk
tv
tw
t·
ues
up
v
v7
vannick
versus
versusproduct
versusproduction
vf
vg
vi
victoria
vii
viv
vj
vl
vovk
vr
vt
vv
vw
v·
v»
wallons
wg
wi
wig
wim
wl
wr
ww
w·
x
y
yi
yn
yr
yy
¢
¢·
¢‘
£
¤
¤f
¤r
¤¤
¤·
¤»
¤—
¤‘
¤•
¥
¥·
§
§·
§‘
°
·
·1
·4
·a
·e
·f
·g
·i
·l
·r
·s
·t
·v
·w
·y
·¢
·¤
·¥
·§
··
··¤
··»
··—
··‘
·»
·»·
·—
·‘
·’
·•
·•·
»
»i
»v
Ȣ
»¤
Ȥ
»·
»··
»»
»—
»‘
—
—i
—¤
—·
——
—’
‘
‘i
‘r
‘v
‘vv
‘¤
Ԥ
‘°
‘·
‘»
‘•
’
’v
’y
’·
“
“‘
”
•
•1
•¤
•·
•»
•‘
•’
••
€
€·
I actually have the feeling that in the new output lots of meaningsless special
characters occur that could easily be filtered ...
@Toon&Charles: What is your feeling as you've opened this isuue?
Original comment by alu...@gmail.com
on 20 Jul 2009 at 8:31
I think we deal with three versions; the one I reported which is not very good.
The current one which seems
pretty ok, and the last one that seem ok, but worse the the current one. At
least, from this sample.
It would be good to determine how this happens. The way I see it the process is
roughly two parts: indexing and filtering the bad ones. I can't say anything
about the filtering from this sample, but the filtering the bad words
seems worse. In other words, is it worth it to check the new results with the
current filtering and take a look at
the differences then?
Original comment by charles....@kmt.hku.nl
on 20 Jul 2009 at 9:46
To proceed further, we first need Andree's indexing output from _all_ the GAMA
content _saved_ separately in *.txt files. One directory per movie and one txt
file
per frame.
This is a step, where we have most integration problems.
After we have ocr output, we can run newest version of the filtering and
compare results.
Original comment by anc2...@gmail.com
on 21 Jul 2009 at 7:16
Sure, the latest versions saves the raw output per frame, so the filtering can
then
be touched without reanalysing. This will be installed on the new 9.04 machine
beginning of next week and then I'll reanalyse. Early August you can then start
a
deeper analysis on this issue. For now I've integrated the update you've sent
yesterday. Here's the output on the video above from the new version:
1ean—francois
44_
4_4
acc¤u»m1m
ag_
alix
andrin
arne
artist
assistant
aurelie
bastien
belcioue
belcique
benoit
bissot
boutet
bronckart
catrvsse
catrysse
cent
centre
cfb
cine
cinema
cinematography
coca
codon
cola
color
communaute
debattice
declerck
directing
director
dominioue
dominique
d_arto_s
editing
editor
ei_
emmanuel
engineer
eric
eve
executive
eynaudi
fi_
francaise
gg_
gilles
gi_
goos
hansei_
hansel
hensgens
ia_
iean_michel
iespers
if_
ii_
ii__
issaka
iv_
i_i
jacque5—henrl
jean_michel
jerome
jespers
jq_
kinds
kortste
lebeer
leiieer
lenaerts
leroy
lf_
line
ln_
lorenza
make_up
marc
marcel
marianne
marion
martin
matthias
muriel
nachten
nicdlas
nicolas
olivier
patrick
peter
produced
producer
producers
production
provost
rf_
sacre
sarah
sawadogo
schoffeniels
script
sound
support
teledistributeurs
tellin
thanks
timer
vannick
versus
ve_
victoria
vii_
viv
vi_
vovk
vv_
vv__
vw_
v_v
wallons
wim
wi_
yr_
_11
_44
_ii
_vv
_vv_
__44
Original comment by alu...@gmail.com
on 21 Jul 2009 at 8:19
The results are better, however filtering is less restrictive in some cases for
shapex functionality. Level of detail can be further adjusted, but let us not
expect
perfect dictionary words matching. Remember, that the input obtained from OCR is
_very_ messy for low-quality video.
Original comment by anc2...@gmail.com
on 21 Jul 2009 at 11:25
Nobody expects perfect matching, and it isn't needed, but the previous results
were very messy and caused
unnecessary bloat. The result above look very good. Maybe some filtering on the
_ and it looks very close to
perfect. If there is no reason to keep it in of course. Nice work!
Original comment by charles....@kmt.hku.nl
on 21 Jul 2009 at 12:27
Below you will find new results. Some additional filtering on the _ was carried
out.
accountant
alix
andrin
arne
artist
assistant
aurelie
benoj1
bronckart
catrvsse
ceb
cent
centre
cfb
cilles
cine
cinema
cinematographv
coca
codon
cola
color
commlinaute
communaute
d_arto1s
d_artois
declerck
derattice
director
dominioue
editing
editor
emmanueliespers
engineer
eric
eve
evnaud1
evnaudi
executive
francaise
gdos
gent
hansel
henri
hensgens
icdlas
iiv
induction
issaka
jacques_
jacques_henri
jean_francois
jean_michel
jerome
kinds
kortste
l_audiovisuel
lebeer
lenaert5
lenaerts
lerov
line
lorenza
maiie_up
marc
marcel
marianne
marion
martin
matthias
muriel
nachten
nlcolas
olivier
patrick
peter
prodlicer
produced
producer
producers
production
provost
rastien
rissot
rroncrart
sacre
sarah
sawadogd
schoffeniels
sound
support
teledistributeurs
tellir
thanks
timer
vannicr
versus
victoria
vii
vovk
wallons
wim
Original comment by anc2...@gmail.com
on 23 Jul 2009 at 1:33
Toon, Colleagues,
Have you got any further comments to the OCR filtering.
If not, I'd kindly ask Toon for closing the issue.
Kind regards,
Mikołaj
Original comment by mikolaj....@gmail.com
on 23 Jul 2009 at 10:08
Colleagues,
Can we close this bug?
Toon?
Regards,
Mikołaj
Original comment by mikolaj....@gmail.com
on 28 Jul 2009 at 9:51
@mikolaj,
Me or Charles will look again into this issue once back from holidays and will
close close if all is well.
regards
Toon
Original comment by toon...@gmail.com
on 3 Aug 2009 at 7:34
Original comment by alu...@gmail.com
on 19 Aug 2009 at 10:32
Original issue reported on code.google.com by
toon...@gmail.com
on 27 Jan 2009 at 3:10