inspirehep / rest-api-doc

Documentation of the INSPIRE REST API
https://inspirehep.net
Creative Commons Attribution Share Alike 4.0 International
40 stars 10 forks source link

Get affiliation directly #4

Closed restrepo closed 4 years ago

restrepo commented 4 years ago

In the old API the author info included the affiliation information. For example, it was possible to obtain the authors belonging to some specific institutions directly in the first API requests.

Expected behavior: for a CMS paper with 2270 authors

>>> import pandas as pd
>>> df=pd.read_json('http://old.inspirehep.net/search?p=doi:10.1103/PhysRevLett.122.132001&of=recjson')
>>> df.authors.apply(lambda l: [ d['full_name'] for d in l   
                              if d['affiliation']=='Antioquia U.'
                           ] )

[Out]: [Mejia Guisao, Jhovanny, Ruiz Alvarez, José David] Problem: Now I would need to make 2270 additional API requests to get the same answer.

It is worth noticing that both the Crossref and Lens.org API's have the 'affiliation' key directly in the author dictionary

michamos commented 4 years ago

You definitely don't need to make 2270 additional API requests. All the data is there but is more nested than that because there can be several affiliations for every author and each affiliation can contain more information than a string (see the schema docs for more info).

I'm not well versed in pandas so I don't know how it represents this kind of structure, but in plain Python you can do the following:

Python 3.8.3 (default, May 14 2020, 11:03:12) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import requests                                                                                                                                                      

In [2]: response = requests.get('https://inspirehep.net/api/doi/10.1103/PhysRevLett.122.132001')                                                                              

In [3]: authors = response.json()['metadata']['authors']                                                                                                                     

In [4]: names = [author['full_name'] for author in authors if any(aff['value'] == 'Antioquia U.' for aff in author['affiliations'])]                                         

In [5]: names                                                                                                                                                                
Out[5]: ['Mejia Guisao, Jhovanny', 'Ruiz Alvarez, José David']
restrepo commented 4 years ago

Sorry, I figure out that was checking only recent entries that didn't have the affiliation info yet.