idio / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby
17 stars 2 forks source link

Removing built in dkpro dep and using official dep #49

Closed tgalery closed 6 years ago

tgalery commented 6 years ago

Connects to #47

tinychaos42 commented 6 years ago

What is this, description, connected issue?

hmcc commented 6 years ago

Connects to https://github.com/idio/json-wikipedia/issues/45 I think?

hmcc commented 6 years ago

I still haven't gone through it all but to start with...

tgalery commented 6 years ago

@heather some replies:

tgalery commented 6 years ago

So diffing the result of pairCounts obtained from a previous jsonpedia dump and the one that this produced we get a 5 megabyte file, that contains 80% of the time slightly different counts +1 or -1 for the same line or pairs with exactly 1 count, which we remove anyway. Looking at the diff, these cases seem interesting:

Interesting diff cases:

6920d6919 
< 10,000&nbsp;BCE   http://dbpedia.org/resource/10th_millennium_BC  1
7422c7421
< 1000 Lakes Rally  http://dbpedia.org/resource/Rally_Finland   62
---
> 1000 Lakes Rally  http://dbpedia.org/resource/Rally_Finland   60
> А. Koroviakov.    http://dbpedia.org/resource/Alexander_Koroviakov    1
14392992a14381495
> А. Naumov.    http://dbpedia.org/resource/Alexander_Naumov    3
14393307d14381809
< ГАЗ-233034    http://dbpedia.org/resource/GAZ_Tigr    2
14393449a14381952
> Е. Kostenko.  http://dbpedia.org/resource/Elena_Kostenko  1
14393728c14382231
< К. Rumiantseva.   http://dbpedia.org/resource/Kapitolina_Rumiantseva  1
---
> К. Rumiantseva.   http://dbpedia.org/resource/Kapitolina_Rumiantseva  3

I have the feeling that in our port of the durmstatd lib we might be doing some normalisation (or else that is done in the proper lib now). It would be important to determine that, and if no normalisation is done, we should unify that somewhere (maybe in wikistats)

Other than that, we are generating the files to create a full model for benchmark.

hmcc commented 6 years ago

sorry

.... for all the nitpicking.

What IDE are you using? In Eclipse, nearly everything I've mentioned shows up as a warning. I'm a big fan of Ctrl+Alt+O to "organise" imports (remove unused and alphabetise the remainder) - your IDE should have an equivalent.

I am loving all the red though! Definitely worth doing, and hopefully a step towards getting more in line with upstream too 👍

stathischaritos commented 6 years ago

@hmcc I fixed some of the notes, will do the rest after lunch :) Nitpicking is good helps me learn the code better!

hmcc commented 6 years ago

Can we/should we be using Maven Central?

EDIT: sorry, I see we are using Maven Central for de.tudarmstadt.ukp.wikipedia itself, the additional repo is for dependencies of de.tudarmstadt.ukp.wikipedia, right?

stathischaritos commented 6 years ago

we need the de.tudarmstadt.ukp.wikipedia repository to get the 1.2.0-SNAPSHOT version, otherwise we get 1.1.0

hmcc commented 6 years ago

Ah, OK... in that case why SNAPSHOT and not the latest stable version (and the version that the upstream version uses)?

hmcc commented 6 years ago

Much better!

Still to do:

tgalery commented 6 years ago

@hmcc I can ask the first question, 1.2.0-SNAPSHOT is the most recent and supported version of the library, I think that for some 1.1.0 might not be even available in the repositories anymore.

hmcc commented 6 years ago

1.2.0-SNAPSHOT is the most recent and supported version of the library

I don't know this library, or the developers, but that's not how it appears to me. 1.2.0 isn't mentioned on the releases page. 1.1.0 is reasonably recent (2016) and all the 1.1.0 dependencies we use are in Maven Central.

Unless you know something I don't, or we need a feature in one of the 35 commits since this release, I'd be very wary of depending on what looks to be a nightly build. Not saying we shouldn't do it, but I'd like to understand the reason a bit better.

hmcc commented 6 years ago

Nearly there @stathischaritos with the nitpicks:

stathischaritos commented 6 years ago

CONLL Dataset on PR branch

precision: 0.45492908358573914
recall: 0.5367404474836542
fScore: 0.49246010338508955

CONLL Dataset on Dev branch

precision: 0.4490344524383545
recall: 0.5411468330134357
fScore: 0.4908062299724581

CSAW Dataset on PR branch

precision: 0.6596273183822632
recall: 0.32491968793024323
fScore: 0.4353797250204602

CSAW Dataset on DEV branch

precision: 0.662124514579773
recall: 0.32706134312375706
fScore: 0.4378455906281876
stathischaritos commented 6 years ago

CONLL Dataset on updated PR branch

precision: 0.4449195861816406
recall: 0.5491842610364683
fScore: 0.49158412341242286

CSAW Dataset on updated PR branch

precision: 0.657878577709198
recall: 0.3282851460914793
fScore: 0.4380038725443337

I still see some jpg's left in the surface forms, grep counted 160 of them :(. For the previous PR model that count was ~8.5K, and for the dev model its around 100.

hmcc commented 6 years ago

squash please