hbz / nwbib

Die Nordrhein-Westfälische Bibliographie
http://nwbib.de
3 stars 2 forks source link

Adjust geo-based queries #494

Open acka47 opened 5 years ago

acka47 commented 5 years ago

With https://github.com/hbz/lobid-resources/pull/1031, all bigger regions (NRW itself, Rheinland, Westfalen etc.) will also have geo coordinates. We have to at least adjust the geo-based queries on the home page.

acka47 commented 5 years ago

On the home page there are two ways of diving into the data:

  1. by clicking a "Kreis" or "kreisfreie Stadt" at https://nwbib.de/
  2. by clicking a "Gemeinde" or "kreisfreie Stadt" at https://nwbib.de/?map=gemeinden

The question is: Which types of places should be covered / not covered by those queries?

For a start, I checked which Wikidata types occur and how often in the NWBib data (using lobid-resources-staging to include focus data from https://github.com/hbz/lobid-resources/issues/1029), see https://gist.github.com/acka47/097b976da0c9eaab7679d9ad80f3e75e.

At this point, seeing 106 different types, I scrapped this whole approach (leaving it here for documentation, though) and just looked at the spatial classification to see which regions are to big and should be excluded from queries based on geo coordinates. I think it is a good rule of thumb to exclude all regions from the second level of the concept scheme which can be easily filtered out by this SPARQL query:

import rdflib

g=rdflib.Graph()
g.parse("nwbib-spatial.ttl", format='turtle')
count = 0
results = g.query("""
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX foaf:  <http://xmlns.com/foaf/0.1/>
    SELECT ?secondLevelConcept
    WHERE {
    ?secondLevelConcept a skos:Concept ;
        skos:broader ?topLevelConcept .
    FILTER NOT EXISTS { ?topLevelConcept skos:broader ?anything }
    }
""")

for row in results:
    print("%s" % row)

The result:

https://nwbib.de/spatial#Q7920
https://nwbib.de/spatial#Q1787360
https://nwbib.de/spatial#N69
https://nwbib.de/spatial#N33
https://nwbib.de/spatial#N01
https://nwbib.de/spatial#Q1604680
https://nwbib.de/spatial#N72
https://nwbib.de/spatial#N74
https://nwbib.de/spatial#N57
https://nwbib.de/spatial#Q1787374
https://nwbib.de/spatial#Q251069
https://nwbib.de/spatial#N10
https://nwbib.de/spatial#N54
https://nwbib.de/spatial#Q7927
https://nwbib.de/spatial#N64
https://nwbib.de/spatial#N03
https://nwbib.de/spatial#N46
https://nwbib.de/spatial#N65
https://nwbib.de/spatial#N42
https://nwbib.de/spatial#Q313969
https://nwbib.de/spatial#N62
https://nwbib.de/spatial#Q457468
https://nwbib.de/spatial#N32
https://nwbib.de/spatial#N18
https://nwbib.de/spatial#N76
https://nwbib.de/spatial#N91
https://nwbib.de/spatial#Q1787376
https://nwbib.de/spatial#N28
https://nwbib.de/spatial#N16
https://nwbib.de/spatial#Q7924
https://nwbib.de/spatial#N34
https://nwbib.de/spatial#Q1803148
https://nwbib.de/spatial#N47
https://nwbib.de/spatial#Q7926
https://nwbib.de/spatial#N63
https://nwbib.de/spatial#N52
https://nwbib.de/spatial#Q1689034
https://nwbib.de/spatial#N43
https://nwbib.de/spatial#N45
https://nwbib.de/spatial#N48
https://nwbib.de/spatial#Q896929
https://nwbib.de/spatial#Q1787260
https://nwbib.de/spatial#Q1787322
https://nwbib.de/spatial#N68
https://nwbib.de/spatial#N13
https://nwbib.de/spatial#N36
https://nwbib.de/spatial#N44
https://nwbib.de/spatial#Q1110953
https://nwbib.de/spatial#N20
https://nwbib.de/spatial#N66
https://nwbib.de/spatial#N14
https://nwbib.de/spatial#N12
https://nwbib.de/spatial#N22
https://nwbib.de/spatial#Q1803239
https://nwbib.de/spatial#Q7923
https://nwbib.de/spatial#N05
https://nwbib.de/spatial#N24
https://nwbib.de/spatial#N77
https://nwbib.de/spatial#N70

To exclude this from search, we have to add those to the respective queries like so:

_exists_:spatial AND NOT spatial.id:("https://nwbib.de/spatial#Q7920" OR "https://nwbib.de/spatial#Q1787360")

fsteeg commented 5 years ago

The problem with the filtering approach (based on types or the actual coordinates) is that, with a normal query, it would exclude all hits with e.g. https://nwbib.de/spatial#N01, even if that hit had other additional spatial entries.

We could in theory set that up as a nested query in lobid-resources, but that would be quite complex and would restrict location queries in general. Or we'd have to add an option, further increasing complexity.

I think the most straightforward approach would be to exclude the geo field for the entities that don't actually describe a location, but an area. We could retain the other focus information, and thus keep the Wikidata links.