inspirehep / inspire

Official repo of the legacy INSPIRE-HEP overlay
http://projecthepinspire.net
17 stars 20 forks source link

bibcheck: new plugin to convert GRID IDs to ROR #458

Closed michamos closed 4 years ago

michamos commented 5 years ago

Signed-off-by: Micha Moskovic michamos@gmail.com

michamos commented 5 years ago

Tested on inspiretest, seems to be working correctly. The only thing to notice is that in a handful of cases, the GRID we have on the institution record has been redirected to a new GRID following a merge of their records (e.g. https://grid.ac/institutes/grid.4461.7), which ROR don't handle yet.

michamos commented 4 years ago

The file is generated from the ROR dumps at https://github.com/ror-community/ror-api/tree/master/rorapi/data (I used 2019-05-06). After unzipping a dump, you get a ror.json file, which is a list of institutions. This can be transformed into the grid_to_ror.json mapping with the following small script:

import json
with open('Downloads/ror.json') as f: 
     ror = json.load(f)
for inst in ror: 
     grid = inst.get('external_ids', {}).get('GRID', {}).get('preferred') 
     if not grid: 
         continue 
     grid_to_ror[grid] = inst['id']
with open('grid_to_ror.json', 'w') as f:
    json.dump(grid_to_ror, f)

I generated this file locally as it was taking ages on the legacy machines (probably due to allocation inefficiencies in the antique Python version used there).

Should I add the file to the PR? where should I put it and how do I make sure it gets copied to the right place? are the Makefiles used during deployment?

tsgit commented 4 years ago

good questions.

There currently isn't a good place to add that file. If there are no other anticipated uses for that file it logically belongs to the bibcheck plugin and I think a new subdirectory data (or similar) in bibcheck/ and adding the subdir to https://github.com/inspirehep/inspire/blob/master/bibcheck/Makefile#L7 and creating a Makefile in the new subdir with the proper install rule.

It may also need an update of the fabfile for deployment -- but there are many instances where fabfile makes the wrong guess and I simply edit the deploy recipe as necessary during deployment.

Alternatively, skip all that. Assume the file is optional and might also be updated outside of the code base. Adjust the plugin to handle the case of missing or empty file gracefully instead of bringing bibcheck down. I.e. issue a warning instead of an IOError. Then simply put the file where you want it on the worker nodes.

michamos commented 4 years ago

@tsgit I updated it to gracefully the case when there is no file or the file is wrong.