Refactor AbstractRetrieval().collaboration to deal with multiple collaborations

astrochun commented 1 month ago

pybliometrics version: 3.5.1

Code to reproduce the bug:

from pybliometrics.scopus import AbstractRetrieval
result = AbstractRetrieval('10.1088/1741-4326/ad3fcd', id_type='doi', refresh=True, view='FULL')
result.authorgroup

In some cases, the collaboration is more than one and thus it can be a list of dict

Error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 result.authorgroup

File ~/codes/PPPL/pppl-elink2411-metadata/venv/lib/python3.10/site-packages/pybliometrics/scopus/abstract_retrieval.py:102, in AbstractRetrieval.authorgroup(self)
     93         except KeyError:  # Collaboration
     94             given = au.get('ce:text')
     95         new = auth(affiliation_id=aff_id,
     96                    organization=org,
     97                    city=aff.get('city'),
     98                    dptid=dep_id,
     99                    postalcode=aff.get('postal-code'),
    100                    addresspart=aff.get('address-part'),
    101                    country=aff.get('country'),
--> 102                    collaboration=collaboration.get('ce:indexed-name'),
    103                    auid=int(au['@auid']),
    104                    orcid=au.get('@orcid'),
    105                    surname=au.get('ce:surname'),
    106                    given_name=given,
    107                    indexed_name=chained_get(au, index_path))
    108         out.append(new)
    109 return out or None

AttributeError: 'list' object has no attribute 'get'

nils-herrmann commented 1 month ago

While working in the Issue a new (related) bug appeared. The use of pop() changes the list items in-place and consequently the class attribute self._head:

items = listify(self._head.get('author-group', []))
items = items.copy()
index_path = ['preferred-name', 'ce:indexed-name']
# Check for collaboration
keys = [k for x in items for k in list(x.keys())]
if "collaboration" in keys:
collaboration = items.pop(-1)['collaboration']

This causes an inconsistent behaviour of the authorgroup property if called many times:

ab9 = AbstractRetrieval("2-s2.0-85097473741", view="FULL", refresh=30)
print(ab9.authorgroup[1].collaboration)

J-PARC-HI Collaboration
print(ab9.authorgroup[1].collaboration)
None

Michael-E-Rose commented 1 month ago

Thanks for reporting this @astrochun! We definitely need to fix this. But we need to make a call.

The thing with .pop() was definitely a bad choice, it's good you found this Nils!

The assumption behind the current code is that the last item (and only the last) may be a collaboration, and if that is the case, all authors are part of the collaboration. That's the case with our current test case, which is https://www.sciencedirect.com/science/article/pii/S0375947420304176.

With https://iopscience.iop.org/article/10.1088/1741-4326/ad3fcd, the collaborations are clearly additional authors, and the other authors are not part of this collaboration. The json thus looks like this:

>>> items[-1]["collaboration"]
[{'@seq': '35', '@collaboration-instance-id': 'OB2BibRecID-956319644-041e7807f7ccdf10d4b034d796c85612-1', 'ce:text': 'MAST-U Team', 'ce:indexed-name': 'MAST-U Team'},
 {'@seq': '36', '@collaboration-instance-id': 'OB2BibRecID-956319644-a423363f5cba04c966158c7f0d0c9b76-1', 'ce:text': 'the ASDEX Upgrade Team', 'ce:indexed-name': 'the ASDEX Upgrade Team'},
 {'@seq': '37', '@collaboration-instance-id': 'OB2BibRecID-956319644-2581e82369347a65d834e87163c1cb39-1', 'ce:text': 'the EUROfusion Tokamak Exploitation Team', 'ce:indexed-name': 'the EUROfusion Tokamak Exploitation Team'},
 {'@seq': '38', '@collaboration-instance-id': 'OB2BibRecID-956319644-bee8a1ede8b251294e44cc11ba45756b-1', 'ce:text': 'JET Contributors', 'ce:indexed-name': 'JET Contributors'}]

Now the question is, which case is more common: collaboration as additional author, or all authors as part of the collaboration? From the Scopus API, both cases look the same.

If the latter is more common, we need to change the code such that all collaborations enter as individual authors in out. This would mean to move the "Check for collaboration" inside the loop starting on L80.

nils-herrmann commented 1 month ago

I think it is best not to assume that authors are part of collaborations. It bothers me to put together collaborations and author groups like Scopus. (1) We could create a new property collaborations that returns a list of Collaboration named tuples:

[Collaboration(collaboration_id='OB2BibRecID-956319644-041e7807f7ccdf10d4b034d796c85612-1', indexed_name='MAST-U Team'),
 Collaboration(collaboration_id='OB2BibRecID-956319644-a423363f5cba04c966158c7f0d0c9b76-1', indexed_name='the ASDEX Upgrade Team'),
 Collaboration(collaboration_id='OB2BibRecID-956319644-2581e82369347a65d834e87163c1cb39-1', indexed_name='the EUROfusion Tokamak Exploitation Team'),
 Collaboration(collaboration_id='OB2BibRecID-956319644-bee8a1ede8b251294e44cc11ba45756b-1', indexed_name='JET Contributors')]

Or (2) we can just add the Collaboration named tuples to the authorgroup list:

[Author(affiliation_id=60000481, dptid=113851322, organization='Department of Physics and Astronomy, University of Padova', city='Padova', postalcode=None, addresspart=None, country='Italy', collaboration=['MAST-U Team', 'the ASDEX Upgrade Team', 'the EUROfusion Tokamak Exploitation Team', 'JET Contributors'], auid=36617535300, orcid='0000-0002-7928-4661', indexed_name='Piron L.', surname='Piron', given_name='L.'),
 Author(affiliation_id=60000481, dptid=113851322, organization='Department of Physics and Astronomy, University of Padova', city='Padova', postalcode=None, addresspart=None, country='Italy', collaboration=['MAST-U Team', 'the ASDEX Upgrade Team', 'the EUROfusion Tokamak Exploitation Team', 'JET Contributors'], auid=35498717900, orcid=None, indexed_name='Martin P.', surname='Martin', given_name='P.'),
 (...),
 Collaboration(collaboration_id='OB2BibRecID-956319644-041e7807f7ccdf10d4b034d796c85612-1', indexed_name='MAST-U Team'),
 Collaboration(collaboration_id='OB2BibRecID-956319644-a423363f5cba04c966158c7f0d0c9b76-1', indexed_name='the ASDEX Upgrade Team'),
 (...)]

astrochun commented 1 month ago

In my experience, these collaborations are generally under as an additional author. In fact, my understanding is that Scopus uses a list of authors in a mentioned paper (usually a footnote) to pull in the full author list.

The case I encountered had multiple teams/collaborations.

Michael-E-Rose commented 1 month ago

In fact, my understanding is that Scopus uses a list of authors in a mentioned paper (usually a footnote) to pull in the full author list.

But this then means that Scopus pairs authors and collaborations, no?

Michael-E-Rose commented 4 weeks ago

I received answer from Scopus (like many things on Scopus, the Support team is deprecated greatly in terms of quality):

For the author's collaborators name under the appendix or the supplement is attached separately or re-directing to the other link or the pdf of the article, then the authors name or the group name cannot be captured in Scopus as per the coverage policy.

I guess this means that collaborations are meant to be separated. Often Scopus messes up the collaborations anyways and files them as independent authors.

Thus I prefer this solution:

collaborations remain part of the .authorgroup property. Technically, the are authors, so nothing wrong here. If Scopus provided collaborations in a separate field in their API, we'd use a different property as well. But it is a foundational principle of pybliometrics to only return data as is.
For collaborations, insert a new field "collaboration_instance_id" to contain the @collaboration-instance-id, and field "indexed_name" should contain the ce:indexed-name. Thus we drop column "collaboration".

Could you do that please, @nils-herrmann ? Another training example (where AbstractRetrieval().authorgroup fails) is 2-s2.0-85044008512.

nils-herrmann commented 4 weeks ago

Sounds good! Just to clarify, collaborations are then represented with the Author named tuple where collaboration_instance_id is None for authors?:

[Author(affiliation_id=60000481, dptid=113851322, organization='Department of Physics and Astronomy, University of Padova', city='Padova', postalcode=None, addresspart=None, country='Italy', auid=36617535300, orcid='0000-0002-7928-4661', indexed_name='Piron L.', surname='Piron', given_name='L.', collaboration_id=None),
 (...),
 Collaboration(affiliation_id=None, dptid=None, organization=None, city= None, postalcode=None, addresspart=None, country=None, collaboration=None, auid=None, orcid=None,  indexed_name='MAST-U Team', surname=None, collaboration_id='OB2BibRecID-956319644-041e7807f7ccdf10d4b034d796c85612-1'),
 (...)]

Michael-E-Rose commented 4 weeks ago

Exactly!

astrochun commented 3 weeks ago

After it is release, I'll go ahead and test it out.

Michael-E-Rose commented 3 weeks ago

The patch is live, @astrochun - can you confirm that it solves your problem?

pybliometrics-dev / pybliometrics

Refactor AbstractRetrieval().collaboration to deal with multiple collaborations #336