Improve wikidata matching

acka47 commented 6 years ago

Follow-up to #130. Improve the results of the wikidata matching.

Here is an overview over the current status (source):

Wievielen Titeln mit Ortsnennung konnten keine Wikidata-Id zugeordnet werden?
- 51721
Wieviele distinkte Orte sind das?
- 3128
- Liste
Wieviele Orte habe eine WD-Id mit einem score >4 bekommen?
- 259.456
Wieviele distinkte Orte sind das?
- 4975
- Liste (mit Kontext)
Bei wievielen gab es eine erfolgreiche Fallback-Anfrage?
- 2980
Wieviele distinkte Orte sind das?
- 2966
- Liste

acka47 commented 6 years ago

As discussed offline:

[x] We won't query the Wikidata API anymore as fallback because all resources we need should be in the set covered by the SPARQL query. If they are not in there, something has to be corrected at Wikidata. (With this, we also won't have distinct Wikidata sets for the lobid data and the visualized classification entry.)
[x] We will add the label from the superordinated administrative (P131.rdfs:label""@de ) to the ES geo index and this will lead to matches for a lot of strings in currently not matched (e.g. "Ahaus Ottenstein" & ~Q615951~ Q2443 or Q1670041.

dr0i commented 6 years ago

Not having enough time left yesterday I did the following:

build a new geo-index with only SPARQL-query entities [1]
remove fallback
reduce minimum score threshold to a very low score

Resources are indexing at the moment.

[1] Here is a list of WD-entities lacking geo-coordinates: https://gist.github.com/dr0i/e9bb9e1083a56ebaef602ec60991c458#file-missinggeocoordswikidata

acka47 commented 6 years ago

[1] Here is a list of WD-entities lacking geo-coordinates: https://gist.github.com/dr0i/e9bb9e1083a56ebaef602ec60991c458#file-missinggeocoordswikidata

I assume this is the list of Wikidata entities without geo coordinates that are matched with NWBib location strings? Because if I adjust the SPARQL query we use to build the index to only deliver entries without coordinates I get much more items: http://tinyurl.com/ybmnpppc

dr0i commented 6 years ago

I assume this is the list of Wikidata entities without geo coordinates that are matched with NWBib location strings

~yes~ ups, no: I got these when creating the geo index, which is done as a preprocess for the matching. Your SPARQ got more hits because the preprocessing program enounters at some point a connection timeout when lookups of the entities are done, which I assume is a result of "over-heating" wikidata (will implement a little "sleep" between the lookups). However... Agreed offline with @acka47 to fill the geo index with entities even if these don't have any geo coordinates. Then we will be able to identify the resources without geo-coordinates which are indeed used by our dataset .

dr0i commented 6 years ago

Deployed to staging. @acka47 I am interested in your quick judgement about the new geo enrichment. This query reveals 289.672 hits, it's not so bad yes?: https://stage.lobid.org/resources/search?q=spatial.geo.lat:*

acka47 commented 6 years ago

Yes, this looks very good! I checked out around 15 random entries and only found one that wasn't good: https://stage.lobid.org/resources/HT013056531.json

[x] There, Münster (Westfalen) Hauptbahnhof appears under spatial which is strange as this wikidata item should not be part of the "geo" index at all.

Looking at the examples from https://github.com/hbz/lobid-resources/issues/130#issuecomment-315068388, most of the problems still exist:

[x] http://stage.lobid.org/resources/HT017034736.json: "Monschauer Land" now matched with "Rheindalen-Land"
[x] http://stage.lobid.org/resources/HT017068756.json: "Petershagen <Minden-Lübbecke>" matched to "Kreis Minden-Lübbecke"
[x] http://stage.lobid.org/resources/HT017125573.json: Wuppertal-Barmen" matched with "Barmen (Jülich)"
[x] http://stage.lobid.org/resources/HT016189278.json: Pyrmont <Grafschaft> matched with Grafschaft
[x] http://stage.lobid.org/resources/HT003751388.json: "Eslohe (Sauerland)" matched with "Sauerland"

These examples make clear that somehow there is still a fallback because we have wikidata items linkec that can not be part of the index built based on the SPARQL query from https://github.com/hbz/lobid-resources/issues/130#issuecomment-331440115.

dr0i commented 6 years ago

Note: this is on production now. We will improve it but this was a necessary fix for problems arising through the slowliness of the WD-fallback-lookup. Also, the old fallback-query filled the geo-nwib-index with data coming from the fallback-API-query while we want to have a database solely grounded on the SPARQ.

acka47 commented 6 years ago

Just took a look at staging.

Obviously, the SPARQL query for creating the lookup index hasn't changed as "Westfalen" is still in there, see http://staging.lobid.org/resources/HT008353259.json

Furthermore, what I said in a mail on 2017-10-16 still holds:

(es) sind bei den ersten zehn Treffern fast alle Matches inkorrekt. Der Score muss also wieder deutlich höher, siehe z.B. http://test.lobid.org/resources/HT006934472.json, wo zwei Strings auf "Düsseldorf" gemactcht werden, obwohl es mindestens für den einen (Wittlaer) einen eigenen Eintrag gibt, der auch bei der SPARQL-Query mit abgedeckt sein müsste (https://www.wikidata.org/wiki/Q881495).

dr0i commented 6 years ago

Uh, the staging index is an old one, of last week, see http://gaia.hbz-nrw.de:9200/_plugin/head/. Obviously the new index failed to be build. The SPARQ seems to be OK, though, because the geo-index doesn't hold the westfalen-entity anymore: curl http://gaia.hbz-nrw.de:9200/geo_nwbib/_search?q=Q8614

dr0i commented 6 years ago

Deployed to production. I made a tickable checklist of the problematic matches. Two are solved, four remain.

acka47 commented 6 years ago

We now manually created a test set with correct matches.

There are several entries in the "coverage" field that don't refer to administrative areas (which we only pull from Wikidata. Examples are "Monschauer Land", "Grafschaft Pyrmont", "Sternberg ", "Sternberg ". At least some of them should rather be included in the spatial classification under "Landschaften", "Historische Territorien". I will contact the NWBib editors regarding this as soon as we are satisfied with the general matching.

acka47 commented 6 years ago

Regarding the non-administrative entities, this is no surprise as we also put the Gliedernde schlagwörter from other notations but 96,97,99 into "coverage". To exclude non-administrative areas, we should probably just remove all the other notations from the morph (currently line 2655).

acka47 commented 6 years ago

I updated line 24 in http://etherpad.lobid.org/p/geo-testset. I couldn't find the test matches on GitHub otherwise I would have updated it myself. Can you put them on here somewhere so we don't need the etherpad anymore?

dr0i commented 6 years ago

Discussed offline with @acka47 that, as the geo-cache is only a temporary work and will be unnecessary when QIDs are cataloged into aleph db, there is no need to publish the scripts.

dr0i commented 6 years ago

@acka47 I tickedHT017068756.json: "Petershagen <Minden-Lübbecke>" and HT017125573.json: Wuppertal-Barmen" matched with "Barmen (Jülich)" as they are good now.

dr0i commented 6 years ago

See https://gist.github.com/dr0i/162c8ef05def3058d69b5d0762e40c50 for a CSV with coverage-QIDs-Score<4 .

acka47 commented 6 years ago

I added some comments to the csv file and adjusted a lot of entries in Wikidata, see https://gist.github.com/acka47/2cd92ecacd718f8e2c3b96fcf79de733. The results should be much better with after the next matching cycle..

acka47 commented 6 years ago

@dr0i Could you please update https://gist.github.com/dr0i/162c8ef05def3058d69b5d0762e40c50 with the new round data?

dr0i commented 6 years ago

http://stats.lobid.org/scoreCoverageCsv.csv

acka47 commented 6 years ago

I updated the https://gist.github.com/acka47/2cd92ecacd718f8e2c3b96fcf79de733 based on the new data. Will write an email to NWBib editors next week.

acka47 commented 6 years ago

I think the last thing to do to close this issue is #627.

acka47 commented 6 years ago

I now took a look at the current matching results at http://stats.lobid.org/scoreCoverageCsv_alleWerte.csv and found some issues we probably can solve with adjusting the boosting:

[ ] "Kreis Düsseldorf" is matched with Q1787267 (Kreis Düsseldorf-Mettmann) instead of Q1803134 (Landkreis Düsseldorf )
[x] "Steinfurt Westfalen" is matched with Q181609 (Lengerich in Kreis Steinfurt) instead of Q16018 (Steinfurt)
[x] "Rhein Erft Kreis" is matched to Q3989 (Bergheim) instead of Q6292 (Rhein-Erft-Kreis )
[ ] "St Augustin" is matched to Q2222154 ("Sankt Augustin-Ort") instead of Q4090
[x] "Bergisch Gladbach" is matched with Q15122735 (Neuenhaus (Bergisch Gladbach)) instead of Q3117 (Bergisch Gladbach) (War nicht im Geo-Index. Sollte nun gefixt sein.)

dr0i commented 6 years ago

"Steinfurt" has no spatial.type in the geo-nwbib-cache. This is unacceptable. Fix these: http://weywot5.hbz-nrw.de:9200/geo_nwbib/_search?q=NOT+_exists_:spatial.type.

dr0i commented 6 years ago

Improved geo-cache json building, e.g. Steinfurt now has a type, see http://weywot5.hbz-nrw.de:9200/geo_nwbib-20180730-1530/_search?q=Q16018. Using this to build full index, ready at thursday.

dr0i commented 6 years ago

Fixed "Steinfurt"-Syndrom, see e.g. http://lobid.org/resources/HT014058465.json. Also updated the statistics - you may want to have a look at it @acka47.

acka47 commented 5 years ago

There have been several cases where "ist ein Ortsteil" (Q253019) was deleted from Wikidata and instead "ist ein Stadtteil" (Q2983893) was added, e.g.:

Q2983893 might in fact be the better choice. To also get them into the geo index, I will update the query (and the matching configuration).

acka47 commented 5 years ago

Wesel was removed from the index because the type "Gemeinde in Deutschland" (Q262166) was removed (see diff) so that it only has type Q1548518 (Große kreisangehörige Stadt ) which is a subclass of a subclass (Q42744322) of Q262166. I will adjust the SPARQL query to get it back in.

acka47 commented 5 years ago

The matching is perfect by now. ;-) If there are any further adjustments to be made, it will probably be in Wikidata and otherwise we can open a new specific issue. Closing.

hbz / lobid-resources

Improve wikidata matching #585