pybliometrics-dev / pybliometrics

Python-based API-Wrapper to access Scopus
https://pybliometrics.readthedocs.io/en/stable/
Other
401 stars 124 forks source link

Deal with html mistakes in affilnames (&) #338

Closed raffaem closed 1 week ago

raffaem commented 1 month ago

pybliometrics version:

Code to reproduce the bug:

>>> res = ScopusSearch("DOI(10.1038/s41556-022-01034-3)")
>>> len(res.results[0].afid.split(";"))
15
>>> len(res.results[0].affilname.split(";"))
16

Expected behavior:

Michael-E-Rose commented 1 month ago

The different number of entities after splitting along the semicolon is because of the name of the second affiliation: Chimie & Biologie de la Cellule. Obviously they introduced an error here, when the affilname should simply be "Chimie & Biologie de la Cellule".

That's not direct the fault of pybliometrics, only indirectly because the semicolon is used to turn the list of affilnames into a string. Theoretically one could use others, such as the pipe or the percentage sign. But both will come with similar problems, as they might correctly or erroneously enter affilname as well.

In the present case you have two option:

  1. You use AffiliationRetrieval() to get the affilname
  2. You get the affilname directly from the ScopusSearch()._json
raffaem commented 1 month ago

After several tests, I believe we should pass author names and affiliation names to html.unescape before returning them to the user.

Michael-E-Rose commented 1 month ago

Yes, that's actually a nice idea. I would fix something that Scopus broke up - it is not intended to transmit &. There will still be problems with author names where Scopus erroneously introduced a semicolon (in a name, imagine!). So it will not solve all problems, but a lot.

Ideally we should give the user the option to skip the unescape, i.e., use a new parameter unescape=True. Wanna provide a PR?