hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
7 stars 7 forks source link

Räumliche Erschließung von übergangsweise parallel QIDs und Strings #998

Closed dr0i closed 4 years ago

dr0i commented 5 years ago

Result of hbz/nwbib#470:

New catalogued nwbib data will look like this:

https://nwbib.de/spatial#N05 Westfalen https://nwbib.de/spatial#Q2742 Münster https://nwbib.de/subjects#N844200 Denkmalpflege. Denkmalschutz

The spatial entries in the jsonld will be generated in two ways:

  1. the old way: if an entry in 7001n1 doesn't start with "https:" lookup the literal in geo_nwbib and use the result to build the jsonld structure via ElasticsearchIndexer.java
  2. the new way: if it starts with "https:" lookup the tsv map (consisting of all the entries in geo_nwbib) and build the data structure using the morph

@acka47 What to do with the coverage field, valued e.g. "Köln | 99"? See e.g. http://test.lobid.org/resources/HT015854197.json . This will be omitted going with 2) I think (at leas the snippet above doesn't reflect these entries, and I assume nobody will catalogue them anymore, right?)), but be necessary when going with 1) which we must use in parallel for at least some time.

acka47 commented 5 years ago

What to do with the coverage field, valued e.g. "Köln | 99"? See e.g. http://test.lobid.org/resources/HT015854197.json . This will be omitted going with 2) I think (at leas the snippet above doesn't reflect these entries, and I assume nobody will catalogue them anymore, right?)), but be necessary when going with 1) which we must use in parallel for at least some time.

Exctly, if entries start with https, the coverage field won't be necessary anymore. (That will also be a good way to check for already resources already updated in hbz01 which will then be those with _exists_:spatial AND NOT _exists_:coverage.

acka47 commented 5 years ago

Thus, we have to build a comprehensive tsv file covering all Wikidata entities needed so that we get data (coordinates & type) for the focus object. There are three different source to gather the entries from:

  1. The SPARQL query for the geo index (/src/main/resources/getNwbibSubjectLocationsAsWikidataEntities.sparql)
  2. The hand-written string2qid list at /src/main/resources/string2wikidata.tsv that can now easily be created and updated with the SPARQL query from https://github.com/hbz/nwbib/issues/468#issuecomment-493189871
  3. 33 entries that already have been part of NWBib spatial classification where a focus statement has been added to the SKOS file, see https://github.com/hbz/lobid-vocabs/commit/2dbcd44af7968092a158d33c1392bac517e96802#diff-a7e9c11f735c098bc8b7bb474747fc93.

Next steps: add the 33 QIDs from 3.) to the SPARQL query for 2.), adjust the SPARQL query for 1.) to be concatenated with the rest.

acka47 commented 5 years ago

So I added the 34 (!) IDs from 3.) to the SPARQL query with eeade06, however, this will not be enough for adding the focus object. The reason is that something like this will be in the data: https://nwbib.de/spatial#n03 while the corresponding QID to derive the focus information from is Q152243. We will have to think about a solution for this.

acka47 commented 5 years ago

I also adjusted /src/main/resources/getNwbibSubjectLocationsAsWikidataEntities.sparql with 5492362. Now the results of both SPARQL queries can easily be concatenated for one TSV file to be used in the transformation process.

acka47 commented 5 years ago

something like this will be in the data: https://nwbib.de/spatial#n03 while the corresponding QID to derive the focus information from is Q152243. We will have to think about a solution for this.

I will add a map URI to QID so that @dr0i can look it up.

acka47 commented 5 years ago

Here is the map. Note that notation 70 has two WD entities as focus.

https://nwbib.de/spatial#N1 Q1198
https://nwbib.de/spatial#N3 Q152243
https://nwbib.de/spatial#N5 Q8614
https://nwbib.de/spatial#N10    Q462011
https://nwbib.de/spatial#N12    Q72931
https://nwbib.de/spatial#N13    Q2036208
https://nwbib.de/spatial#N14    Q4194 
https://nwbib.de/spatial#N16    Q580471
https://nwbib.de/spatial#N18    Q881875
https://nwbib.de/spatial#N20    Q151993
https://nwbib.de/spatial#N22    Q153464
https://nwbib.de/spatial#N24    Q445609
https://nwbib.de/spatial#N28    Q152356 
https://nwbib.de/spatial#N32    Q1380992
https://nwbib.de/spatial#N33    Q1381014
https://nwbib.de/spatial#N34    Q1413205
https://nwbib.de/spatial#N42    Q7904317
https://nwbib.de/spatial#N44    Q836937
https://nwbib.de/spatial#N45    Q641138
https://nwbib.de/spatial#N46    Q249428
https://nwbib.de/spatial#N47    Q152420
https://nwbib.de/spatial#N48    Q708742
https://nwbib.de/spatial#N57    Q698162
https://nwbib.de/spatial#N62    Q657241
https://nwbib.de/spatial#N63    Q649192
https://nwbib.de/spatial#N64    Q650645
https://nwbib.de/spatial#N65    Q697254
https://nwbib.de/spatial#N66    Q514557
https://nwbib.de/spatial#N68    Q700198
https://nwbib.de/spatial#N69    Q573290
https://nwbib.de/spatial#N70    Q14551680,Q835382
https://nwbib.de/spatial#N76    Q153943
https://nwbib.de/spatial#N77    Q829718
dr0i commented 5 years ago

@acka47 I took your example MABXml into the test. This is the latest result: https://gist.github.com/dr0i/c649f05af0cc32aa5baec4b3c04871d2. Re: subjects - please have a look, I didn't do anything in the morph for this, but it looks good, doesn't it?

acka47 commented 5 years ago

+1 Everything looks good. As discussed offline, we will have to think about the broader regions (e.g. Eifel, Weserbergland, or Nordrhein-Westfalen itself) that have one geo point attached. it does not make sense to use those geo coordinates for the "result map". So we have to think about a way to handle this (ignoring these coordinates or not storing them to begin with). For now, we will leave it as is.

dr0i commented 5 years ago

Deployed to production, closed.

acka47 commented 4 years ago

Reopening. From https://github.com/hbz/nwbib/issues/470#issuecomment-541045997:

Die Art der Speicherung von URI plus String in 700n wird ja nun anders ablaufen als in .https://github.com/hbz/nwbib/issues/470#issuecomment-483588151 dargestellt. Die endgültige Fassung mit Ablage von String und URI in unterschiedlichen Unterfeldern ist im Wiki dokumentiert, hier das Beispiel:

<datafield tag="700" ind1="n" ind2="1">
    <subfield code="a">Ruhrgebiet</subfield>
    <subfield code="0">https://nwbib.de/spatial#N20</subfield>
</datafield>
<datafield tag="700" ind1="n" ind2="1">
    <subfield code="a">Duisburg</subfield>
    <subfield code="0">https://nwbib.de/spatial#Q2100</subfield>
</datafield>
<datafield tag="700" ind1="n" ind2="1">
    <subfield code="a">Essen</subfield>
    <subfield code="0">https://nwbib.de/spatial#Q2066</subfield>
</datafield>
<datafield tag="700" ind1="n" ind2="1">
    <subfield code="a">Einzelne Autoren (Primärliteratur)</subfield>
    <subfield code="0">https://nwbib.de/subjects#N768010</subfield>
</datafield>

We will have to update the respective file in hbz01XmlClobs.tar.bz2 and adjust the morph accordingly. The new cataloging practice will begin by the end of November.

dr0i commented 4 years ago

This should be resolved with #1036. Although there is no way of real testing, as there is only so small a designed test case, I am going to deploy it to not have any conflicts in the morph when other work is done there. Should be reopened when real data is coming in and behaves in a bad manner.