hbz / lobid-gnd

UI and API to the Integrated Authority File (Gemeinsame Normdatei, GND)
http://lobid.org/gnd
Eclipse Public License 2.0
24 stars 5 forks source link

Count GND IDs in hbz01 for UndifferentiatedPersons with variantName #222

Closed acka47 closed 4 years ago

acka47 commented 4 years ago

Requested by colleague I.G. in the context of the effort to remove undifferentiated persons from GND (see https://wiki.dnb.de/x/aJDOC for background).

  1. Get a URI list of all GND entries that are of type undifferentiated person and have at least one variant name: http://lobid.org/gnd/search?q=type%3AUndifferentiatedPerson+AND+_exists_%3AvariantName (~1 Million entries)
  2. Write a script that runs each of these GND IDs against lobid-resources to find out whether it is linked somewhere, e.g. contribution.agent.id:"http://d-nb.info/gnd/103804528"
acka47 commented 4 years ago
  1. $ curl --header "Accept: application/x-jsonlines" "http://lobid.org/gnd/search?q=type%3AUndifferentiatedPerson+AND+_exists_%3AvariantName&size=10" | jq -r .id > undifferentiated-with-variantName.txt
  2. 
    import requests
    import json

filepath = 'undifferentiated-with-variantName.txt'

fp = open(filepath) count = 0

def build_url(id): return 'https://lobid.org/resources/search?q=contribution.agent.id%3A%22' + id.rstrip() + '%22'

for id in fp.readlines(): if requests.get(build_url(id)).json()['totalItems'] > 0: print id count += 1 print count



The script is currently running. Will provide the number of IDs when finished.
acka47 commented 4 years ago

After ~18 hours the script stopped running and threw an error:

Traceback (most recent call last):
  File "gnd-ids-in-hbz01.py", line 13, in <module>
    if requests.get(build_url(id)).json()['totalItems'] > 0:
  File "/home/acka47/.local/lib/python2.7/site-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded 

I will modify it a bit and try again.

acka47 commented 4 years ago

I tweaked is a bit and am now running it directly against the ES index. It is much faster (~100 requests per second).

acka47 commented 4 years ago

Here is the updated script for step 2.) (with obfuscated index url):

import requests
import json

filepath = 'undifferentiated-with-variantName.txt'

fp = open(filepath)#
total = 0
count = 0

def build_url(id):
    return 'index/_search?q=contribution.agent.id%3A%22' + id.rstrip() + '%22'

for id in fp.readlines():
    total += 1
    if requests.get(build_url(id)).json()['hits']['total'] > 0:
        count += 1
        print("%s IDs of %s are in hbz01." % (count, total))

Result: 502466 IDs of 975198 are in hbz01.

acka47 commented 4 years ago

Closing.