Closed raffaem closed 1 week ago
The different number of entities after splitting along the semicolon is because of the name of the second affiliation: Chimie & Biologie de la Cellule
. Obviously they introduced an error here, when the affilname should simply be "Chimie & Biologie de la Cellule".
That's not direct the fault of pybliometrics, only indirectly because the semicolon is used to turn the list of affilnames into a string. Theoretically one could use others, such as the pipe or the percentage sign. But both will come with similar problems, as they might correctly or erroneously enter affilname as well.
In the present case you have two option:
AffiliationRetrieval()
to get the affilnameScopusSearch()._json
After several tests, I believe we should pass author names and affiliation names to html.unescape
before returning them to the user.
Yes, that's actually a nice idea. I would fix something that Scopus broke up - it is not intended to transmit &
.
There will still be problems with author names where Scopus erroneously introduced a semicolon (in a name, imagine!). So it will not solve all problems, but a lot.
Ideally we should give the user the option to skip the unescape, i.e., use a new parameter unescape=True
. Wanna provide a PR?
pybliometrics version:
Code to reproduce the bug:
Expected behavior: