Removing built in dkpro dep and using official dep

tgalery commented 6 years ago

Connects to #47

tinychaos42 commented 6 years ago

What is this, description, connected issue?

hmcc commented 6 years ago

Connects to https://github.com/idio/json-wikipedia/issues/45 I think?

hmcc commented 6 years ago

I still haven't gone through it all but to start with...

is this still actually a fork of https://github.com/diegoceccarelli/json-wikipedia, or is this ours now?
- if it's a fork, are we ever going to pull in upstream changes?
- if it's ours, then can we change the package names?
why do we have tests in de/tudarmstadt/ukp/wikipedia/, when that is gone from the source?

tgalery commented 6 years ago

@heather some replies:

this is our fork from diego's branch. He is actually a good friend of mine, but at the moment the difference between the two repo's is enourmous, so we can't easily reconcile the two. That being said I'm trying to upstream what I can.
I don't feel like changing the package name because otherwise it would be harder to upstream things.
We are currently maintaining the tests in the de/tudarmstadt package so we can validate some hypothesis that david had in the past. Once that's done, we will try to recast those as tests in json-wikipedia itself.

tgalery commented 6 years ago

So diffing the result of pairCounts obtained from a previous jsonpedia dump and the one that this produced we get a 5 megabyte file, that contains 80% of the time slightly different counts +1 or -1 for the same line or pairs with exactly 1 count, which we remove anyway. Looking at the diff, these cases seem interesting:

Interesting diff cases:

colon at the start of the anchor

5181c5180,5181 [Why are we not removing initial colon]
< :07 Seconds or Less My Season on the Bench with the Runnin' and Gunnin' Phoenix Suns  http://dbpedia.org/resource/07_Seconds_or_Less  2
---
> :07 Seconds or Less My Season on the Bench with the Runnin' and Gunnin' Phoenix Suns  http://dbpedia.org/resource/07_Seconds_or_Less  1
> 07 Seconds or Less My Season on the Bench with the Runnin' and Gunnin' Phoenix Suns   http://dbpedia.org/resource/07_Seconds_or_Less  1
5926c5926

anchors with weird html chars

6920d6919 
< 10,000&nbsp;BCE   http://dbpedia.org/resource/10th_millennium_BC  1
7422c7421
< 1000 Lakes Rally  http://dbpedia.org/resource/Rally_Finland   62
---
> 1000 Lakes Rally  http://dbpedia.org/resource/Rally_Finland   60

Some removed cases, might be good to understand

< 1000 Pillar Temple    http://dbpedia.org/resource/Saavira_Kambada_Basadi  1
7816d7813
< 1,000-rupiah banknote http://dbpedia.org/resource/Indonesian_rupiah   1
< 1040 ST   http://dbpedia.org/resource/Atari_ST    1
15998d15990
< 10412 class patrol boat   http://dbpedia.org/resource/Project_10412-class_patrol_boat 1
17945d17936
< 1064&nbsp;nm laser    http://dbpedia.org/resource/Nd:YAG_laser    1
18531,18532c18522
< 107-123 Muswell Hill Road http://dbpedia.org/resource/107–123_Muswell_Hill_Road   1
< 107–123 Muswell Hill Road http://dbpedia.org/resource/107–123_Muswell_Hill_Road   2
< 1861 Lincoln "solferino" china    http://dbpedia.org/resource/China_service_of_the_Lincoln_administration 1

Language name spaces thing

<133 départements   http://dbpedia.org/resource/130_departments_of_the_First_French_Empire  1
< भैरव  http://dbpedia.org/resource/Bhairava    1
14395288d14383789
< शिव   http://dbpedia.org/resource/Shiva   1

Double space in sf

< 16th  arrondissement  http://dbpedia.org/resource/16th_arrondissement_of_Paris    135
< 16th arrondissement   http://dbpedia.org/resource/16th_arrondissement_of_Paris    135
---
> 16th  arrondissement  http://dbpedia.org/resource/16th_arrondissement_of_Paris    134
> 16th arrondissement   http://dbpedia.org/resource/16th_arrondissement_of_Paris    134

Weird Spaces

< Zurich, Switzerland   http://dbpedia.org/resource/Zürich  188
< Zurich,Switzerland    http://dbpedia.org/resource/Zürich  188
< Zürich, Switzerland   http://dbpedia.org/resource/Zürich  587
---
> Zurich, Switzerland   http://dbpedia.org/resource/Zürich  187
> Zurich,Switzerland    http://dbpedia.org/resource/Zürich  187
> Zürich, Switzerland   http://dbpedia.org/resource/Zürich  586

Trailing punctuation

> А. Koroviakov.    http://dbpedia.org/resource/Alexander_Koroviakov    1
14392992a14381495
> А. Naumov.    http://dbpedia.org/resource/Alexander_Naumov    3
14393307d14381809
< ГАЗ-233034    http://dbpedia.org/resource/GAZ_Tigr    2
14393449a14381952
> Е. Kostenko.  http://dbpedia.org/resource/Elena_Kostenko  1
14393728c14382231
< К. Rumiantseva.   http://dbpedia.org/resource/Kapitolina_Rumiantseva  1
---
> К. Rumiantseva.   http://dbpedia.org/resource/Kapitolina_Rumiantseva  3

I have the feeling that in our port of the durmstatd lib we might be doing some normalisation (or else that is done in the proper lib now). It would be important to determine that, and if no normalisation is done, we should unify that somewhere (maybe in wikistats)

Other than that, we are generating the files to create a full model for benchmark.

hmcc commented 6 years ago

sorry

.... for all the nitpicking.

What IDE are you using? In Eclipse, nearly everything I've mentioned shows up as a warning. I'm a big fan of Ctrl+Alt+O to "organise" imports (remove unused and alphabetise the remainder) - your IDE should have an equivalent.

I am loving all the red though! Definitely worth doing, and hopefully a step towards getting more in line with upstream too 👍

stathischaritos commented 6 years ago

@hmcc I fixed some of the notes, will do the rest after lunch :) Nitpicking is good helps me learn the code better!

hmcc commented 6 years ago

Can we/should we be using Maven Central?

EDIT: sorry, I see we are using Maven Central for de.tudarmstadt.ukp.wikipedia itself, the additional repo is for dependencies of de.tudarmstadt.ukp.wikipedia, right?

stathischaritos commented 6 years ago

we need the de.tudarmstadt.ukp.wikipedia repository to get the 1.2.0-SNAPSHOT version, otherwise we get 1.1.0

hmcc commented 6 years ago

Ah, OK... in that case why SNAPSHOT and not the latest stable version (and the version that the upstream version uses)?

hmcc commented 6 years ago

Much better!

Still to do:

why use de.tudarmstadt.ukp.wikipedia 1.2.0-SNAPSHOT not 1.1.0?
static regexes and method documentation in ArticleParser.java
space invader in Locale.java
new HashSet<> in Namespaces.java
comment in ParserTest.java doesn't match
better imports in EnglishArticleTest.java

tgalery commented 6 years ago

@hmcc I can ask the first question, 1.2.0-SNAPSHOT is the most recent and supported version of the library, I think that for some 1.1.0 might not be even available in the repositories anymore.

hmcc commented 6 years ago

1.2.0-SNAPSHOT is the most recent and supported version of the library

I don't know this library, or the developers, but that's not how it appears to me. 1.2.0 isn't mentioned on the releases page. 1.1.0 is reasonably recent (2016) and all the 1.1.0 dependencies we use are in Maven Central.

Unless you know something I don't, or we need a feature in one of the 35 commits since this release, I'd be very wary of depending on what looks to be a nightly build. Not saying we shouldn't do it, but I'd like to understand the reason a bit better.

hmcc commented 6 years ago

Nearly there @stathischaritos with the nitpicks:

static final for the regexes
still got the mixed spaces/tabs here - apologies for saying it was ArticleParser, it's actually in the test.

stathischaritos commented 6 years ago

CONLL Dataset on PR branch

precision: 0.45492908358573914
recall: 0.5367404474836542
fScore: 0.49246010338508955

CONLL Dataset on Dev branch

precision: 0.4490344524383545
recall: 0.5411468330134357
fScore: 0.4908062299724581

CSAW Dataset on PR branch

precision: 0.6596273183822632
recall: 0.32491968793024323
fScore: 0.4353797250204602

CSAW Dataset on DEV branch

precision: 0.662124514579773
recall: 0.32706134312375706
fScore: 0.4378455906281876

stathischaritos commented 6 years ago

CONLL Dataset on updated PR branch

precision: 0.4449195861816406
recall: 0.5491842610364683
fScore: 0.49158412341242286

CSAW Dataset on updated PR branch

precision: 0.657878577709198
recall: 0.3282851460914793
fScore: 0.4380038725443337

I still see some jpg's left in the surface forms, grep counted 160 of them :(. For the previous PR model that count was ~8.5K, and for the dev model its around 100.

hmcc commented 6 years ago

squash please

idio / json-wikipedia