scholarly-python-package / scholarly

Retrieve author and publication information from Google Scholar in a friendly, Pythonic way without having to worry about CAPTCHAs!
https://scholarly.readthedocs.io/
The Unlicense
1.42k stars 304 forks source link

pub_year is used as a bib and bibtex field, rather than year, and the bibtex function has incomplete conversion of it as well #431

Open stanleyrhodes opened 2 years ago

stanleyrhodes commented 2 years ago

Describe the bug The bibtex format supports the field year. However, within publication['bib'], pub_year is used as the field, and further, the scholarly.bibtex() function outputs a .bib file with pub_year that then won't import properly (e.g., Zotero import). This behavior can be seen in the quickstart example: https://scholarly.readthedocs.io/en/latest/quickstart.html#bibtex

There is evidence that someone in the past meant to change pub_year to year in the bibtex() function. It's in _BIB_MAPPING but NOT in _BIB_REVERSE_MAPPING for some reason. bibtex() takes _BIB_REVERSE_MAPPING as an argument.

To Reproduce Run bibtex() on any publication where the GS entry has a pub date. The bug can be seen and replicated using the example in the quickstart: https://scholarly.readthedocs.io/en/latest/quickstart.html#bibtex

Expected behavior The bibtex() function should output a .bib file with the proper year field.

However, after some digging around, I would make a stronger claim: I see no reason pub_year should be used in the publication object at all, not in publication[bib] or anywhere else. Consider where it occurs: https://github.com/scholarly-python-package/scholarly/search?q=pub_year

get_publication() > _scholar_pub() > 3 blocks of code that create publication['bib']['pub_year'] fill() also has one place where ['pub_year'] is created. Together that's 4 places in publication_parser.py The BibEntry object has it.

This leaves me with the questions, all of which I currently think are answered with "no": Is there any reason not to just change it at the start, in _scholar_pub() and fill()? Is there any reason why it's in _BIB_MAPPINGbut NOT in _BIB_REVERSE_MAPPING? Is there any reason not to change it in data_types.py for the BibEntryobject?

It would also need to be changed in test_module.py.

Do you plan on contributing? Not immediately, as I am not yet wise to the ways of tests and contributing established for this project. However, I will be revising this code in Scholarly fork in my repo, including the test_module.py. Also, since the test itself needs to be changed, I'm not sure if that necessitates any additional consideration or overhead.

Additional context There's a larger, more general discussion to be had about bibtex() and .bib output, and citation manager outputs in general. I'm no expert on the bibtex format. But from what I can tell, Scholarly outputs a .bib file that is quite non-standard*: utf-8 rather than latin-1 (iso-8859-1). In my case, this is very much a GOOD thing! Since so many fields (e.g., author) can have characters outside those 256, it's better that it does utf-8. Zotero, for example, allows the import of utf-8 .bib files it even though it's non-standard, to my great relief. My view is that Scholarly should be able to pass all the information possible from the GS entry to Zotero and other citation managers. My needs, for example, include the publication's url as well, and that's not bibtex-standard. However, we could easily have a strict .bib setting and a default max-info-lax-standards setting. More complicated formats (e.g., Zotero RDF) could wait until someone is very motivated.

(*) We also have venue in the .bib output which doesn't seem to add much value at all. Around half the time they have a truncated journal title and the ellipses character in them. journal is the useful field, not venue.

arunkannawadi commented 2 years ago

Please open a PR with your changes nevertheless, and we can worry about the tests later on. We don't want diverging forks, since eventually you might not be able to easily upgrade the package as we make new releases.