Closed tgalery closed 6 years ago
What is this, description, connected issue?
Connects to https://github.com/idio/json-wikipedia/issues/45 I think?
I still haven't gone through it all but to start with...
de/tudarmstadt/ukp/wikipedia/
, when that is gone from the source?@heather some replies:
So diffing the result of pairCounts
obtained from a previous jsonpedia dump and the one that this produced we get a 5 megabyte file, that contains 80% of the time slightly different counts +1 or -1 for the same line or pairs with exactly 1 count, which we remove anyway. Looking at the diff, these cases seem interesting:
Interesting diff cases:
5181c5180,5181 [Why are we not removing initial colon]
< :07 Seconds or Less My Season on the Bench with the Runnin' and Gunnin' Phoenix Suns http://dbpedia.org/resource/07_Seconds_or_Less 2
---
> :07 Seconds or Less My Season on the Bench with the Runnin' and Gunnin' Phoenix Suns http://dbpedia.org/resource/07_Seconds_or_Less 1
> 07 Seconds or Less My Season on the Bench with the Runnin' and Gunnin' Phoenix Suns http://dbpedia.org/resource/07_Seconds_or_Less 1
5926c5926
6920d6919
< 10,000 BCE http://dbpedia.org/resource/10th_millennium_BC 1
7422c7421
< 1000 Lakes Rally http://dbpedia.org/resource/Rally_Finland 62
---
> 1000 Lakes Rally http://dbpedia.org/resource/Rally_Finland 60
Some removed cases, might be good to understand
< 1000 Pillar Temple http://dbpedia.org/resource/Saavira_Kambada_Basadi 1
7816d7813
< 1,000-rupiah banknote http://dbpedia.org/resource/Indonesian_rupiah 1
< 1040 ST http://dbpedia.org/resource/Atari_ST 1
15998d15990
< 10412 class patrol boat http://dbpedia.org/resource/Project_10412-class_patrol_boat 1
17945d17936
< 1064 nm laser http://dbpedia.org/resource/Nd:YAG_laser 1
18531,18532c18522
< 107-123 Muswell Hill Road http://dbpedia.org/resource/107–123_Muswell_Hill_Road 1
< 107–123 Muswell Hill Road http://dbpedia.org/resource/107–123_Muswell_Hill_Road 2
< 1861 Lincoln "solferino" china http://dbpedia.org/resource/China_service_of_the_Lincoln_administration 1
Language name spaces thing
<133 départements http://dbpedia.org/resource/130_departments_of_the_First_French_Empire 1
< भैरव http://dbpedia.org/resource/Bhairava 1
14395288d14383789
< शिव http://dbpedia.org/resource/Shiva 1
Double space in sf
< 16th arrondissement http://dbpedia.org/resource/16th_arrondissement_of_Paris 135
< 16th arrondissement http://dbpedia.org/resource/16th_arrondissement_of_Paris 135
---
> 16th arrondissement http://dbpedia.org/resource/16th_arrondissement_of_Paris 134
> 16th arrondissement http://dbpedia.org/resource/16th_arrondissement_of_Paris 134
Weird Spaces
< Zurich, Switzerland http://dbpedia.org/resource/Zürich 188
< Zurich,Switzerland http://dbpedia.org/resource/Zürich 188
< Zürich, Switzerland http://dbpedia.org/resource/Zürich 587
---
> Zurich, Switzerland http://dbpedia.org/resource/Zürich 187
> Zurich,Switzerland http://dbpedia.org/resource/Zürich 187
> Zürich, Switzerland http://dbpedia.org/resource/Zürich 586
Trailing punctuation
> А. Koroviakov. http://dbpedia.org/resource/Alexander_Koroviakov 1
14392992a14381495
> А. Naumov. http://dbpedia.org/resource/Alexander_Naumov 3
14393307d14381809
< ГАЗ-233034 http://dbpedia.org/resource/GAZ_Tigr 2
14393449a14381952
> Е. Kostenko. http://dbpedia.org/resource/Elena_Kostenko 1
14393728c14382231
< К. Rumiantseva. http://dbpedia.org/resource/Kapitolina_Rumiantseva 1
---
> К. Rumiantseva. http://dbpedia.org/resource/Kapitolina_Rumiantseva 3
I have the feeling that in our port of the durmstatd lib we might be doing some normalisation (or else that is done in the proper lib now). It would be important to determine that, and if no normalisation is done, we should unify that somewhere (maybe in wikistats)
Other than that, we are generating the files to create a full model for benchmark.
.... for all the nitpicking.
What IDE are you using? In Eclipse, nearly everything I've mentioned shows up as a warning. I'm a big fan of Ctrl+Alt+O to "organise" imports (remove unused and alphabetise the remainder) - your IDE should have an equivalent.
I am loving all the red though! Definitely worth doing, and hopefully a step towards getting more in line with upstream too 👍
@hmcc I fixed some of the notes, will do the rest after lunch :) Nitpicking is good helps me learn the code better!
Can we/should we be using Maven Central?
EDIT: sorry, I see we are using Maven Central for de.tudarmstadt.ukp.wikipedia
itself, the additional repo is for dependencies of de.tudarmstadt.ukp.wikipedia
, right?
we need the de.tudarmstadt.ukp.wikipedia repository to get the 1.2.0-SNAPSHOT version, otherwise we get 1.1.0
Ah, OK... in that case why SNAPSHOT and not the latest stable version (and the version that the upstream version uses)?
Much better!
Still to do:
new HashSet<>
in Namespaces.java@hmcc I can ask the first question, 1.2.0-SNAPSHOT is the most recent and supported version of the library, I think that for some 1.1.0 might not be even available in the repositories anymore.
1.2.0-SNAPSHOT is the most recent and supported version of the library
I don't know this library, or the developers, but that's not how it appears to me. 1.2.0 isn't mentioned on the releases page. 1.1.0 is reasonably recent (2016) and all the 1.1.0 dependencies we use are in Maven Central.
Unless you know something I don't, or we need a feature in one of the 35 commits since this release, I'd be very wary of depending on what looks to be a nightly build. Not saying we shouldn't do it, but I'd like to understand the reason a bit better.
Nearly there @stathischaritos with the nitpicks:
static final
for the regexesprecision: 0.45492908358573914
recall: 0.5367404474836542
fScore: 0.49246010338508955
precision: 0.4490344524383545
recall: 0.5411468330134357
fScore: 0.4908062299724581
precision: 0.6596273183822632
recall: 0.32491968793024323
fScore: 0.4353797250204602
precision: 0.662124514579773
recall: 0.32706134312375706
fScore: 0.4378455906281876
precision: 0.4449195861816406
recall: 0.5491842610364683
fScore: 0.49158412341242286
precision: 0.657878577709198
recall: 0.3282851460914793
fScore: 0.4380038725443337
I still see some jpg's left in the surface forms, grep counted 160 of them :(. For the previous PR model that count was ~8.5K, and for the dev model its around 100.
squash please
Connects to #47