Closed acka47 closed 5 years ago
As discussed offline:
P131.rdfs:label""@de
) to the ES geo index and this will lead to matches for a lot of strings in currently not matched (e.g. "Ahaus Ottenstein" & ~Q615951~ Q2443 or Q1670041.Not having enough time left yesterday I did the following:
Resources are indexing at the moment.
[1] Here is a list of WD-entities lacking geo-coordinates: https://gist.github.com/dr0i/e9bb9e1083a56ebaef602ec60991c458#file-missinggeocoordswikidata
[1] Here is a list of WD-entities lacking geo-coordinates: https://gist.github.com/dr0i/e9bb9e1083a56ebaef602ec60991c458#file-missinggeocoordswikidata
I assume this is the list of Wikidata entities without geo coordinates that are matched with NWBib location strings? Because if I adjust the SPARQL query we use to build the index to only deliver entries without coordinates I get much more items: http://tinyurl.com/ybmnpppc
I assume this is the list of Wikidata entities without geo coordinates that are matched with NWBib location strings
~yes~ ups, no: I got these when creating the geo index, which is done as a preprocess for the matching. Your SPARQ got more hits because the preprocessing program enounters at some point a connection timeout when lookups of the entities are done, which I assume is a result of "over-heating" wikidata (will implement a little "sleep" between the lookups). However... Agreed offline with @acka47 to fill the geo index with entities even if these don't have any geo coordinates. Then we will be able to identify the resources without geo-coordinates which are indeed used by our dataset .
Deployed to staging. @acka47 I am interested in your quick judgement about the new geo enrichment. This query reveals 289.672 hits, it's not so bad yes?: https://stage.lobid.org/resources/search?q=spatial.geo.lat:*
Yes, this looks very good! I checked out around 15 random entries and only found one that wasn't good: https://stage.lobid.org/resources/HT013056531.json
Looking at the examples from https://github.com/hbz/lobid-resources/issues/130#issuecomment-315068388, most of the problems still exist:
Pyrmont <Grafschaft>
matched with GrafschaftThese examples make clear that somehow there is still a fallback because we have wikidata items linkec that can not be part of the index built based on the SPARQL query from https://github.com/hbz/lobid-resources/issues/130#issuecomment-331440115.
Note: this is on production now. We will improve it but this was a necessary fix for problems arising through the slowliness of the WD-fallback-lookup. Also, the old fallback-query filled the geo-nwib-index
with data coming from the fallback-API-query while we want to have a database solely grounded on the SPARQ.
Just took a look at staging.
Obviously, the SPARQL query for creating the lookup index hasn't changed as "Westfalen" is still in there, see http://staging.lobid.org/resources/HT008353259.json
Furthermore, what I said in a mail on 2017-10-16 still holds:
(es) sind bei den ersten zehn Treffern fast alle Matches inkorrekt. Der Score muss also wieder deutlich höher, siehe z.B. http://test.lobid.org/resources/HT006934472.json, wo zwei Strings auf "Düsseldorf" gemactcht werden, obwohl es mindestens für den einen (Wittlaer) einen eigenen Eintrag gibt, der auch bei der SPARQL-Query mit abgedeckt sein müsste (https://www.wikidata.org/wiki/Q881495).
Uh, the staging index is an old one, of last week, see http://gaia.hbz-nrw.de:9200/_plugin/head/. Obviously the new index failed to be build.
The SPARQ seems to be OK, though, because the geo-index doesn't hold the westfalen
-entity anymore: curl http://gaia.hbz-nrw.de:9200/geo_nwbib/_search?q=Q8614
Deployed to production. I made a tickable checklist of the problematic matches. Two are solved, four remain.
We now manually created a test set with correct matches.
There are several entries in the "coverage" field that don't refer to administrative areas (which we only pull from Wikidata. Examples are "Monschauer Land", "Grafschaft Pyrmont", "Sternberg
Regarding the non-administrative entities, this is no surprise as we also put the Gliedernde schlagwörter from other notations but 96,97,99 into "coverage". To exclude non-administrative areas, we should probably just remove all the other notations from the morph (currently line 2655).
I updated line 24 in http://etherpad.lobid.org/p/geo-testset. I couldn't find the test matches on GitHub otherwise I would have updated it myself. Can you put them on here somewhere so we don't need the etherpad anymore?
Discussed offline with @acka47 that, as the geo-cache is only a temporary work and will be unnecessary when QIDs are cataloged into aleph db, there is no need to publish the scripts.
@acka47 I tickedHT017068756.json: "Petershagen <Minden-Lübbecke>"
and
HT017125573.json: Wuppertal-Barmen" matched with "Barmen (Jülich)"
as they are good now.
See https://gist.github.com/dr0i/162c8ef05def3058d69b5d0762e40c50 for a CSV with coverage-QIDs-Score<4 .
I added some comments to the csv file and adjusted a lot of entries in Wikidata, see https://gist.github.com/acka47/2cd92ecacd718f8e2c3b96fcf79de733. The results should be much better with after the next matching cycle..
@dr0i Could you please update https://gist.github.com/dr0i/162c8ef05def3058d69b5d0762e40c50 with the new round data?
I updated the https://gist.github.com/acka47/2cd92ecacd718f8e2c3b96fcf79de733 based on the new data. Will write an email to NWBib editors next week.
I think the last thing to do to close this issue is #627.
I now took a look at the current matching results at http://stats.lobid.org/scoreCoverageCsv_alleWerte.csv and found some issues we probably can solve with adjusting the boosting:
"Steinfurt" has no spatial.type
in the geo-nwbib-cache. This is unacceptable. Fix these: http://weywot5.hbz-nrw.de:9200/geo_nwbib/_search?q=NOT+_exists_:spatial.type.
Improved geo-cache json building, e.g. Steinfurt now has a type, see http://weywot5.hbz-nrw.de:9200/geo_nwbib-20180730-1530/_search?q=Q16018. Using this to build full index, ready at thursday.
Fixed "Steinfurt"-Syndrom, see e.g. http://lobid.org/resources/HT014058465.json. Also updated the statistics - you may want to have a look at it @acka47.
There have been several cases where "ist ein Ortsteil" (Q253019) was deleted from Wikidata and instead "ist ein Stadtteil" (Q2983893) was added, e.g.:
Q2983893 might in fact be the better choice. To also get them into the geo index, I will update the query (and the matching configuration).
The matching is perfect by now. ;-) If there are any further adjustments to be made, it will probably be in Wikidata and otherwise we can open a new specific issue. Closing.
Follow-up to #130. Improve the results of the wikidata matching.
Here is an overview over the current status (source):