Closed astrochun closed 3 weeks ago
While working in the Issue a new (related) bug appeared. The use of pop()
changes the list items
in-place and consequently the class attribute self._head
:
items = listify(self._head.get('author-group', []))
items = items.copy()
index_path = ['preferred-name', 'ce:indexed-name']
# Check for collaboration
keys = [k for x in items for k in list(x.keys())]
if "collaboration" in keys:
collaboration = items.pop(-1)['collaboration']
This causes an inconsistent behaviour of the authorgroup
property if called many times:
ab9 = AbstractRetrieval("2-s2.0-85097473741", view="FULL", refresh=30)
print(ab9.authorgroup[1].collaboration)
J-PARC-HI Collaboration
print(ab9.authorgroup[1].collaboration)
None
Thanks for reporting this @astrochun! We definitely need to fix this. But we need to make a call.
The thing with .pop()
was definitely a bad choice, it's good you found this Nils!
The assumption behind the current code is that the last item (and only the last) may be a collaboration, and if that is the case, all authors are part of the collaboration. That's the case with our current test case, which is https://www.sciencedirect.com/science/article/pii/S0375947420304176.
With https://iopscience.iop.org/article/10.1088/1741-4326/ad3fcd, the collaborations are clearly additional authors, and the other authors are not part of this collaboration. The json thus looks like this:
>>> items[-1]["collaboration"]
[{'@seq': '35', '@collaboration-instance-id': 'OB2BibRecID-956319644-041e7807f7ccdf10d4b034d796c85612-1', 'ce:text': 'MAST-U Team', 'ce:indexed-name': 'MAST-U Team'},
{'@seq': '36', '@collaboration-instance-id': 'OB2BibRecID-956319644-a423363f5cba04c966158c7f0d0c9b76-1', 'ce:text': 'the ASDEX Upgrade Team', 'ce:indexed-name': 'the ASDEX Upgrade Team'},
{'@seq': '37', '@collaboration-instance-id': 'OB2BibRecID-956319644-2581e82369347a65d834e87163c1cb39-1', 'ce:text': 'the EUROfusion Tokamak Exploitation Team', 'ce:indexed-name': 'the EUROfusion Tokamak Exploitation Team'},
{'@seq': '38', '@collaboration-instance-id': 'OB2BibRecID-956319644-bee8a1ede8b251294e44cc11ba45756b-1', 'ce:text': 'JET Contributors', 'ce:indexed-name': 'JET Contributors'}]
Now the question is, which case is more common: collaboration as additional author, or all authors as part of the collaboration? From the Scopus API, both cases look the same.
If the latter is more common, we need to change the code such that all collaborations enter as individual authors in out
. This would mean to move the "Check for collaboration" inside the loop starting on L80.
I think it is best not to assume that authors are part of collaborations. It bothers me to put together collaborations and author groups like Scopus. (1) We could create a new property collaborations
that returns a list of Collaboration
named tuples:
[Collaboration(collaboration_id='OB2BibRecID-956319644-041e7807f7ccdf10d4b034d796c85612-1', indexed_name='MAST-U Team'),
Collaboration(collaboration_id='OB2BibRecID-956319644-a423363f5cba04c966158c7f0d0c9b76-1', indexed_name='the ASDEX Upgrade Team'),
Collaboration(collaboration_id='OB2BibRecID-956319644-2581e82369347a65d834e87163c1cb39-1', indexed_name='the EUROfusion Tokamak Exploitation Team'),
Collaboration(collaboration_id='OB2BibRecID-956319644-bee8a1ede8b251294e44cc11ba45756b-1', indexed_name='JET Contributors')]
Or (2) we can just add the Collaboration
named tuples to the authorgroup
list:
[Author(affiliation_id=60000481, dptid=113851322, organization='Department of Physics and Astronomy, University of Padova', city='Padova', postalcode=None, addresspart=None, country='Italy', collaboration=['MAST-U Team', 'the ASDEX Upgrade Team', 'the EUROfusion Tokamak Exploitation Team', 'JET Contributors'], auid=36617535300, orcid='0000-0002-7928-4661', indexed_name='Piron L.', surname='Piron', given_name='L.'),
Author(affiliation_id=60000481, dptid=113851322, organization='Department of Physics and Astronomy, University of Padova', city='Padova', postalcode=None, addresspart=None, country='Italy', collaboration=['MAST-U Team', 'the ASDEX Upgrade Team', 'the EUROfusion Tokamak Exploitation Team', 'JET Contributors'], auid=35498717900, orcid=None, indexed_name='Martin P.', surname='Martin', given_name='P.'),
(...),
Collaboration(collaboration_id='OB2BibRecID-956319644-041e7807f7ccdf10d4b034d796c85612-1', indexed_name='MAST-U Team'),
Collaboration(collaboration_id='OB2BibRecID-956319644-a423363f5cba04c966158c7f0d0c9b76-1', indexed_name='the ASDEX Upgrade Team'),
(...)]
In my experience, these collaborations are generally under as an additional author. In fact, my understanding is that Scopus uses a list of authors in a mentioned paper (usually a footnote) to pull in the full author list.
The case I encountered had multiple teams/collaborations.
In fact, my understanding is that Scopus uses a list of authors in a mentioned paper (usually a footnote) to pull in the full author list.
But this then means that Scopus pairs authors and collaborations, no?
I received answer from Scopus (like many things on Scopus, the Support team is deprecated greatly in terms of quality):
For the author's collaborators name under the appendix or the supplement is attached separately or re-directing to the other link or the pdf of the article, then the authors name or the group name cannot be captured in Scopus as per the coverage policy.
I guess this means that collaborations are meant to be separated. Often Scopus messes up the collaborations anyways and files them as independent authors.
Thus I prefer this solution:
.authorgroup
property. Technically, the are authors, so nothing wrong here. If Scopus provided collaborations in a separate field in their API, we'd use a different property as well. But it is a foundational principle of pybliometrics to only return data as is.@collaboration-instance-id
, and field "indexed_name" should contain the ce:indexed-name
. Thus we drop column "collaboration".Could you do that please, @nils-herrmann ?
Another training example (where AbstractRetrieval().authorgroup
fails) is 2-s2.0-85044008512
.
Sounds good! Just to clarify, collaborations are then represented with the Author
named tuple where collaboration_instance_id
is None
for authors?:
[Author(affiliation_id=60000481, dptid=113851322, organization='Department of Physics and Astronomy, University of Padova', city='Padova', postalcode=None, addresspart=None, country='Italy', auid=36617535300, orcid='0000-0002-7928-4661', indexed_name='Piron L.', surname='Piron', given_name='L.', collaboration_id=None),
(...),
Collaboration(affiliation_id=None, dptid=None, organization=None, city= None, postalcode=None, addresspart=None, country=None, collaboration=None, auid=None, orcid=None, indexed_name='MAST-U Team', surname=None, collaboration_id='OB2BibRecID-956319644-041e7807f7ccdf10d4b034d796c85612-1'),
(...)]
Exactly!
After it is release, I'll go ahead and test it out.
The patch is live, @astrochun - can you confirm that it solves your problem?
pybliometrics version: 3.5.1
Code to reproduce the bug:
In some cases, the collaboration is more than one and thus it can be a list of
dict
Error message: