Closed restrepo closed 4 years ago
You definitely don't need to make 2270 additional API requests. All the data is there but is more nested than that because there can be several affiliations for every author and each affiliation can contain more information than a string (see the schema docs for more info).
I'm not well versed in pandas so I don't know how it represents this kind of structure, but in plain Python you can do the following:
Python 3.8.3 (default, May 14 2020, 11:03:12)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import requests
In [2]: response = requests.get('https://inspirehep.net/api/doi/10.1103/PhysRevLett.122.132001')
In [3]: authors = response.json()['metadata']['authors']
In [4]: names = [author['full_name'] for author in authors if any(aff['value'] == 'Antioquia U.' for aff in author['affiliations'])]
In [5]: names
Out[5]: ['Mejia Guisao, Jhovanny', 'Ruiz Alvarez, José David']
Sorry, I figure out that was checking only recent entries that didn't have the affiliation info yet.
In the old API the author info included the affiliation information. For example, it was possible to obtain the authors belonging to some specific institutions directly in the first API requests.
Expected behavior: for a CMS paper with 2270 authors
[Out]: [Mejia Guisao, Jhovanny, Ruiz Alvarez, José David] Problem: Now I would need to make 2270 additional API requests to get the same answer.
It is worth noticing that both the Crossref and Lens.org API's have the 'affiliation' key directly in the author dictionary