hbz / nwbib

Die Nordrhein-Westfälische Bibliographie
http://nwbib.de
3 stars 2 forks source link

Difference between Wikidata entries with NWBib ID and entries in nwbib-spatial #485

Closed acka47 closed 4 years ago

acka47 commented 4 years ago

As far as I can see, there are about 100 entities more in Wikidata with a NWBib ID (as of now 4421, see https://w.wiki/7aR) than entries in the NWBib spatial vocab ($ grep "skos:Concept" nwbib-spatial.ttl | wc -l results in 4324).

We should investgate this. It should rather be the other way around with more skos:Concepts in nwbib-spatial than NWBib-ID entities in Wikidata as there are a few concepts without a foaf:focus link to Wikidata.

Trying to compare the WD entities with the command line, srt, uniq and comm I did not come to a good results. @fsteeg, please let your comparison script run again.

acka47 commented 4 years ago

Looking at yesterday's efforts again, it seems that I actually was able to create a list of all Wikidata entries with P6814 (NWBib ID) that are not in the SKOS file, see https://gist.github.com/acka47/e6870164a0f7ef9d5772d0ee5f0e1827

fsteeg commented 4 years ago

Based on P6814 query: https://w.wiki/7c2

qid-p6814-missing-in-nwbib.csv.txt qid-p6814-missing-in-wiki.csv.txt

acka47 commented 4 years ago

I added the NWbib Id to the one Wikidata entry where it was missing. It was one place (Lippischer Wald) we just added to the matching process after ingesting NWBib IDs into Wikidata.

Furthermore I look at the 125 Wikidata entries with NWBib ID that do not appear in the SKOS file. First I checked whether they actually exist in lobid-resources. I tested 30 resources, here is the result:

  1. in lobid-resources: Q181825 Q249710 Q257888 Q258118 Q313969 Q815546 Q1158611 Q1326799 Q1368638 Q2567718 Q56398029 Q1803239 Q1787322 Q1778631 Q1628011 Q1301770 Q53577707 Q56398251 Q1368649 Q1231010 Q1786291 Q1368646 Q53907 Q1368642
  2. not in lobid-resources: Q151243 Q54803600 Q54803599 Q8249663 Q2559367 Q882647

All Wikidata entries from 2.) received an NWBib ID by the quickstatements upload I did. So it does not look like as if we have a problem with anyone arbitrarily adding NWBib IDs to Wikidata. I assume that NWBib editors removed the corresponding entries from the catalog.

acka47 commented 4 years ago

I assume that NWBib editors removed the corresponding entries from the catalog.

There are also some Stadtbezirke in there that need to be part of the classification but haver no hits. These are Q54803600 and Q54803599.

Another possibility is that our matching has problems. Looking at the examples, this seems the case for:

We shoudl asap update the geo index and SKOS file and take another look afterwards

fsteeg commented 4 years ago

As discussed offline, most of the remaining missing items are caused by 0 hits in the catalog:

Q151243 Q1271768 ~Q1530668~ ~Q1607906~ Q11343 Q1821058 ~Q32054087~ Q1366743 Q882647 Q2559367 Q8249663

Two are part of a Gemarkung, which itself is not part of NWBib:

Q1301770
Q1760203

Finally one is a former Kreis:

Q1759911
acka47 commented 4 years ago

Re. Q1271768, which is Gangelt-Hastenrath: There exists another Hastenrath (Q1588790) which is already matched from coverage:hastenrath. Looking at the coverage entries, this mostly makes sense:

$ curl http://lobid.org/resources/search?q=coverage%3AHastenrath | jq .member[].coverage
[
  "Hastenrath | 99"
]
[
  "Hastenrath <Eschweiler> | 99"
]
[
  "Eschweiler-Hastenrath | 99"
]
[
  "Süsterseel | 99",
  "Hastenrath | 99"
]
[
  "Eschweiler-Hastenrath | 99"
]
[
  "Hastenrath, Eschweiler | 99"
]
[
  "Eschweiler-Hastenrath | 99"
]
[
  "Hastenrath, Eschweiler | 99"
]
[
  "Hastenrath, Eschweiler | 99",
  "Scherpenseel, Eschweiler | 99"
]
[
  "Eschweiler-Scherpenseel | 99",
  "Eschweiler-Hastenrath | 99"
]
[
  "Scherpenseel, Eschweiler | 99",
  "Hastenrath, Eschweiler | 99"
]

Both entries with coverage: Hastenrath | 99 refer to the other Hastenrath though. Thus, I added it to the manual matching list with https://github.com/hbz/lobid-resources/pull/1013/commits/bec0584debe59419d17b568855d973058ebadd5e.

acka47 commented 4 years ago

Re. Q1530668 (Wingenbach): I removed the NWBib ID from Wikidata as there is another Wingenbach (Q2584381) that the one entry with the respective coverage is already successfully matched to: https://lobid.org/resources/HT016063885

acka47 commented 4 years ago

Re. Q1607906 (Herbeck), there exists another Herbeck (Q55499627) that the entries with coverage:Herbeck are corectly linked to. Thus, I removed the NWBib ID from the Wikidata entry.

acka47 commented 4 years ago

Re. Q32054087: Also removed NWBib ID because https://www.wikidata.org/wiki/Q4082

acka47 commented 4 years ago

The other places from https://github.com/hbz/nwbib/issues/485#issuecomment-526158930 with 0 hits in the catalog will probably be fixed with https://github.com/hbz/lobid-resources/pull/1013.

fsteeg commented 4 years ago

Current state deployed to test: https://test.nwbib.de/spatial

acka47 commented 4 years ago

+1

fsteeg commented 4 years ago

Will redeploy to test after https://github.com/hbz/lobid-resources/pull/1013 is deployed.

acka47 commented 4 years ago

https://github.com/hbz/lobid-resources/pull/1013 is now deployed, please redeploy.

fsteeg commented 4 years ago

I still get a single missing entry with hits in the catalog: https://www.wikidata.org/wiki/Q11343

Is this due to a missing P131 (located in the administrative territorial entity)?

acka47 commented 4 years ago

Is this due to a missing P131 (located in the administrative territorial entity)?

Added it: https://www.wikidata.org/w/index.php?title=Q11343&type=revision&diff=1007529025&oldid=993874056

fsteeg commented 4 years ago

Deployed to test: http://test.nwbib.de/spatial

Classification changes: https://github.com/hbz/lobid-vocabs/pull/97

It seems a previous workaround (for multiple P131, pick the last one) is not good enough: we're losing Regierungsbezirk Münster (https://www.wikidata.org/wiki/Q7920), probably due to its two P131 values. Note that it is missing in production too, probably because in master, multiple P131 are not yet handled at all (one of the original problems triggering this issue).

fsteeg commented 4 years ago

Fixed missing Regierungsbezirke on test and production with https://github.com/hbz/lobid-vocabs/commit/d304bacd482ef5bae695e6bc72e6a95f32fa6994:

https://test.nwbib.de/spatial https://nwbib.de/spatial

Remaining tasks here:

acka47 commented 4 years ago

Re. Grafschaft Rietberg (Q457468): After starting to write an email to NWbib editors, I decided it only makes sense to keep it under N74 instead of moving it to "48 Niederrheinisch-Westfälischer Reichskreis". Let's tomorrow talk about how to implement it.

fsteeg commented 4 years ago

@acka47 I think we're done here, see https://github.com/hbz/nwbib/issues/485#issuecomment-527439605

Grafschaft Rietberg is in too, by overriding the SKOS data (2 broader values, not actually incorrect) with the info from non-90s-qids.json (1 value) in the UI. The permanent real fix will come with #487.

See https://nwbib.de/spatial

fsteeg commented 4 years ago

As discussed offline, I rolled back overriding from non-90s-qids.json to avoid duplicates.

We're still done here, as we will resolve Grafschaft Rietberg in #487.

acka47 commented 4 years ago

Closing