Switch to an active citations parser dependency to replace `pybtex`

pybamm-team / PyBaMM

Fast and flexible physics-based battery models in Python

https://www.pybamm.org/

BSD 3-Clause "New" or "Revised" License

1.13k stars 549 forks source link

Switch to an active citations parser dependency to replace `pybtex` #3648

Open agriyakhetarpal opened 11 months ago

agriyakhetarpal commented 11 months ago

If pybtex is dead, should we switch to something else instead?

I found this active project for parsing bibtex: https://github.com/sciunto-org/python-bibtexparser

Originally posted by @kratman in https://github.com/pybamm-team/PyBaMM/pull/3645#issuecomment-1864414897

agriyakhetarpal commented 11 months ago

Making changes in the pybtex source code is unquestionably non-trivial because their functionality and packaging both rely on pkg_resources due to the sheer number of console scripts and entry points they provide which make use of this functionality. There are a lot of differences between the features provided by pkg_resources and that by the newer importlib.resources and importlib.metadata modules, however, some migration guides are available. I will still complete my attempt and hope that I can get all the tests passing, but TBH it's not worth pursuing if we can switch to another dependency altogether.

We use sphinxcontrib-bibtex in the documentation for automated citations which relies on pybtex internally and can continue to do so. In programmatic use for PyBaMM we should be fine to keep the functionality of the Citations class and print_citations() as close to the current one as possible. Therefore, both roads work but one will be easier to traverse.

prady0t commented 11 months ago

Can you suggest some useful resources that might help. Also if no one is already working on this I would love to !

agriyakhetarpal commented 11 months ago

Hi @prady0t, we currently use pybtex programmatically to convert BibTeX-style citations listed in pybamm/CITATIONS.bib to a compliant text-based citation that can be used for certain functions such as pybamm.print_citations() – please refer to the code in pybamm/citations.py.

We also use sphinxcontrib-bibtex that uses pybtex internally as mentioned above, which can be left unchanged. What we need here is a new BibTeX compliant parser dependency (such as https://github.com/sciunto-org/python-bibtexparser, which would be the best one to go with provided it can give us what we need, but feel free to suggest any better, maintained ones that you can find) and we need to replace the code in citations.py to use this dependency. The workflow and functionality for the Citations class can remain the same.

For example, the Chen2020 parameter set / chemistry:

@article{Chen2020,
  author = {Chen, Chang-Hui and Brosa Planella, Ferran and O'Regan, Kieran and Gastol, Dominika and Widanage, W. Dhammika and Kendrick, Emma},
  title = {{Development of Experimental Techniques for Parameterization of Multi-scale Lithium-ion Battery Models}},
  journal = {Journal of The Electrochemical Society},
  volume = {167},
  number = {8},
  pages = {080534},
  year = {2020},
  publisher = {The Electrochemical Society},
  doi = {10.1149/1945-7111/ab9050},
}

gets converted to UTF-8 plaintext, using pybtex, to:

Chang-Hui Chen, Ferran Brosa Planella, Kieran O’Regan, Dominika Gastol, W. Dhammika Widanage, and Emma Kendrick. "Development of Experimental Techniques for Parameterization of Multi-scale Lithium-ion Battery Models." Journal of the Electrochemical Society 167 (2020): 080534.

which we use further to print citations or create citation tags. The functions in the Citations class are thin wrappers around pybtex's functionality, we want to use python-bibtexparser's API to do the same thing. If it's not possible to entirely replicate our existing features, slight modifications can be suggested as needed. The unit tests for this class and its methods are in tests/unit/test_citations.py, you will need to modify those as well to remove instances where pybtex-related functions are being used.

I will suggest starting with a design for a method that takes in some BibTeX entries first (i.e., modify Citations.read_citations() and Citations._add_citation()), and then saves them to a private attribute inside the __init__ dunder method of the class – before proceeding to the rest of the functionality (adding your own citations, overwriting an existing citation, registering a citation, tagging it, and printing).

Note that pybtex is an optional dependency for us (it is listed as a [cite] extra in pyproject.toml), and so will python-bibtexparser be as such. These guidelines must be followed for working with optional dependencies.

P.S. I will label this as a medium-difficulty issue based on the extent of the tasks required. Maybe we can take (some) ideas from the citations workflow for the Diffrax Python library.

prady0t commented 11 months ago

Thanks a lot for the detailed info! This would really be helpful

prady0t commented 10 months ago

Can you please explain why do we use citations and how does it help? For example :here. Also how does it register a citation above?

agriyakhetarpal commented 10 months ago

Right, a little bit of background won't hurt – we use this citations workflow for letting researchers using PyBaMM to know what papers they are supposed to cite for their publications as described in the Citing PyBaMM section.

A citation is registered through the pybamm.citations.register() method in the Citations class. It receives an input to a citation key present in CITATIONS.bib, say Chen2020 – pybamm.citations.register("Chen2020") will add it to the _papers_to_cite set, which will be processed further using the _parse_citation method - in the end, the print_citations() method is supposed to collate all the citations converted from BibTeX format to plaintext format and showcase them to the user.

An example of a citation being registered can be found by looking into the __init__ dunder method for any of the classes for the battery models and the solvers (for most of them, AFAIR). You can use the ⌘ + Shift + F key combination to look for those throughout the source code.

This way, the citations class keeps parsing, collecting, and preparing plaintext citations throughout the course of a scientific experiment conducted via PyBaMM's functionality (i.e., please refer to the example Jupyter notebooks).

P.S. I feel in hindsight that the functionality for citation tags (see https://github.com/pybamm-team/PyBaMM/pull/2961#issuecomment-1550349625) made things a bit complicated (citations are now not parsed at runtime, but at the time of printing them) – maybe we can make this data flow more streamlined as we go forward with this dependency replacement and go back to parsing them for correctness at an appropriate position.

prady0t commented 10 months ago

I was successfully able to parse using bibtexparser but it comes with some limitations. Referring : article we cannot convert parsed data to text as done by pybtex. https://github.com/pybamm-team/PyBaMM/blob/c3def31a04df163769851733657d9b49b3700bca/pybamm/citations.py#L233

But we can make a custom function for converting to desired type of string output. Logic can look something like this :

import bibtexparser
bibtex_str = """
@article{Harris2020,
  title = {{Array programming with NumPy}},
  author = {Harris, Charles R. and Millman, K. Jarrod and van der Walt, St{\'{e}}fan J. and Gommers, Ralf and Virtanen, Pauli and Cournapeau, David and Wieser, Eric and Taylor, Julian and Berg, Sebastian and Smith, Nathaniel J. and others},
  journal = {Nature},
  volume = {585},
  number = {7825},
  pages = {357--362},
  year = {2020},
  publisher = {Nature Publishing Group},
  doi = {10.1038/s41586-020-2649-2},
}
@article{Sulzer2021,
  title = {{Python Battery Mathematical Modelling (PyBaMM)}},
  author = {Sulzer, Valentin and Marquis, Scott G. and Timms, Robert and Robinson, Martin and Chapman, S. Jon},
  doi = {10.5334/jors.309},
  journal = {Journal of Open Research Software},
  publisher = {Software Sustainability Institute},
  volume = {9},
  number = {1},
  pages = {14},
  year = {2021}
}
"""
library = bibtexparser.parse_string(bibtex_str)
entries = library.entries
author = str
title = str
journal = str
volume = str
year = str
pages = str
txt_format = str
count = 0
for entry in entries:
    #key = entry.key
    count += 1
    for key, value in entry.items():
        if key == "author":
            author = value
        if key == "title":
            title = value
        if key == "journal":
            journal = value
        if key == "volume":
            volume = value
        if key == "year":
            year = value
        if key == "pages":
            pages = value
    txt_format = f'''[{count}] {author},"{title}" {journal} {volume} ({year}): {pages}'''
    print(txt_format)

Output:

[1] Harris, Charles R. and Millman, K. Jarrod and van der Walt, St{'{e}}fan J. and Gommers, Ralf and Virtanen, Pauli and Cournapeau, David and Wieser, Eric and Taylor, Julian and Berg, Sebastian and Smith, Nathaniel J. and others,"{Array programming with NumPy}" Nature 585 (2020): 357--362
[2] Sulzer, Valentin and Marquis, Scott G. and Timms, Robert and Robinson, Martin and Chapman, S. Jon,"{Python Battery Mathematical Modelling (PyBaMM)}" Journal of Open Research Software 9 (2021): 14

Whereas output of pybamm.print_citations() looks like this:

[1] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, and others. Array programming with NumPy. Nature, 585(7825):357–362, 2020. doi:10.1038/s41586-020-2649-2.
[2] Valentin Sulzer, Scott G. Marquis, Robert Timms, Martin Robinson, and S. Jon Chapman. Python Battery Mathematical Modelling (PyBaMM). Journal of Open Research Software, 9(1):14, 2021. doi:10.5334/jors.309.

Obviously we can change output format as per needed.

P.S. There's another parser mentioned in the article mentioned above, Citeproc-py. We can also maybe look into that (if printing like this is not satisfactory)

agriyakhetarpal commented 10 months ago

Thanks for the article @prady0t, I had a brief look at citeproc-py and while it does look to be full of features, using it doesn't look very promising to me, for two reasons:

It is unmaintained – the last CI run was 8 months ago, and
it claims to support Python 3.7+, but the developers haven't started testing against Python 3.12 yet – which arrived in October, which, in-turn, was less than eight months ago.

Switching to such a dependency would be a step in the wrong direction as we start supporting newer Python versions and dropping older ones every year.

OTOH, python-bibtexparser may be limited in functionality, but the maintainers for it are responsive. We don't need to use all of their features, i.e., the Citations class is just a data processing system of sorts written to work with the .bib data storage format (and a few other quality-of-life features of course). We can ensure that our .bib file is correctly parsed for each paper at the time of writing unit tests. Your attempt with a custom function to convert the BibTeX string looks like a good start and I don't mind the slight variation (having surnames of authors first should be okay) in the output of pybamm.print_citations – but is it possible to extract the DOI or other URLs, since they are currently missing in it? Once you have an initial, working prototype for most of the current functionality in functions like these, we can start constructing a class that provides these functions to the users.

P.S. you should avoid hardcoding the types of fields in a BibTeX entry as much as possible and look at a dynamic parser (the python-bibtexparser documentation mentions that it can perform the validation for you) instead. It will not be scalable, and we are bound to miss a field in some citation or the other (such as the DOIs for NumPy and PyBaMM in this case) which can lead to data loss between the phases of reading and writing (the current Citations class supports the addition of external citations too, not just the ones we reference!).

prady0t commented 10 months ago

If we do not use any kind of hardcoding and just print the parsed values as string something like this :

for entry in entries:
    key = entry.key
    count += 1
    for key, value in entry.items():
        txt_format = txt_format + " " + value
    print(f"[{count}] {txt_format}")
    txt_format = ' '

We gat this output :

[1]   article Harris2020 {Array programming with NumPy} Harris, Charles R. and Millman, K. Jarrod and van der Walt, St{'{e}}fan J. and Gommers, Ralf and Virtanen, Pauli and Cournapeau, David and Wieser, Eric and Taylor, Julian and Berg, Sebastian and Smith, Nathaniel J. and others Nature 585 7825 357--362 2020 Nature Publishing Group 10.1038/s41586-020-2649-2
[2]   article Sulzer2021 {Python Battery Mathematical Modelling (PyBaMM)} Sulzer, Valentin and Marquis, Scott G. and Timms, Robert and Robinson, Martin and Chapman, S. Jon 10.5334/jors.309 Journal of Open Research Software Software Sustainability Institute 9 1 14 2021

It does print all values without data loss but their sequence depends upon bibtex entry.(Notice how title comes before names) We can somewhat fine tune the output by applying conditions like :

        if key != "ID" and key != "ENTRYTYPE":
            txt_format = txt_format + " " + value

will result in output :

[1]   {Array programming with NumPy} Harris, Charles R. and Millman, K. Jarrod and van der Walt, St{'{e}}fan J. and Gommers, Ralf and Virtanen, Pauli and Cournapeau, David and Wieser, Eric and Taylor, Julian and Berg, Sebastian and Smith, Nathaniel J. and others Nature 585 7825 357--362 2020 Nature Publishing Group 10.1038/s41586-020-2649-2
[2]   {Python Battery Mathematical Modelling (PyBaMM)} Sulzer, Valentin and Marquis, Scott G. and Timms, Robert and Robinson, Martin and Chapman, S. Jon 10.5334/jors.309 Journal of Open Research Software Software Sustainability Institute 9 1 14 2021

Is this approach good enough? I'm also able to extract all links and DOI with this approach. I am looking into middlewares now. It might provide better output.

Rest I was able to modify all functions inside citations.py to use bibtexparser dependency now and everything's working as before! I just have to modify test cases and the output format mentioned above.

agriyakhetarpal commented 10 months ago

Looks great to me :) I do wonder if we can use the middlewares – it would be great to incorporate them if they can provide better outputs. We are not in a rush at this time about this (already patched up on our end temporarily and we won't add or drop support for another Python version soon), so I advise taking up as much time as you need for this for the new design.

This approach might just work – though the output format is a bit clunky as you mentioned; could we try and aim to match the current Pybtex output in this fashion (the names of the authors first, followed by the paper, where it was published, etc. and the rest of the fields) as close as it is possible? We will have to test this approach across all the papers currently listed in the .bib file in the unit tests. I would recommend testing against the BibTeX for a few external battery modelling papers that we don't implement (and other general, popular papers in science) as well to enhance test coverage.

Are you currently able to parse a single @article entry from CITATIONS.bib (i.e., for the citations.register functionality)? We might need to use a bit of regex or some complex parsing for that (see the linecache standard module for Python), but as long as it doesn't cause regressions and gets tested properly, we should be fine with that.

prady0t commented 10 months ago

Yes I am. All models are able to register respective citations and they are later printed when print_citations() is called. The only change is in the format of the parse data (stored in value of _all_citations dict). Example :

model1 = pybamm.lithium_ion.SPM()

nc.print_citations()

Gives this output:

[[Field(key=`title`, value=`{An asymptotic derivation of a single particle model with electrolyte}`, start_line=268), Field(key=`author`, value=`Marquis, Scott G. and Sulzer, Valentin and Timms, Robert and Please, Colin P. and Chapman, S. Jon`, start_line=269), Field(key=`journal`, value=`Journal of The Electrochemical Society`, start_line=270), Field(key=`volume`, value=`166`, start_line=271), Field(key=`number`, value=`15`, start_line=272), Field(key=`pages`, value=`A3693--A3706`, start_line=273), Field(key=`year`, value=`2019`, start_line=274), Field(key=`publisher`, value=`The Electrochemical Society`, start_line=275), Field(key=`doi`, value=`10.1149/2.0341915jes`, start_line=276)], [Field(key=`title`, value=`{Array programming with NumPy}`, start_line=166), Field(key=`author`, value=`Harris, Charles R. and Millman, K. Jarrod and van der Walt, St{\'{e}}fan J. and Gommers, Ralf and Virtanen, Pauli and Cournapeau, David and Wieser, Eric and Taylor, Julian and Berg, Sebastian and Smith, Nathaniel J. and others`, start_line=167), Field(key=`journal`, value=`Nature`, start_line=168), Field(key=`volume`, value=`585`, start_line=169), Field(key=`number`, value=`7825`, start_line=170), Field(key=`pages`, value=`357--362`, start_line=171), Field(key=`year`, value=`2020`, start_line=172), Field(key=`publisher`, value=`Nature Publishing Group`, start_line=173), Field(key=`doi`, value=`10.1038/s41586-020-2649-2`, start_line=174)], [Field(key=`title`, value=`{Python Battery Mathematical Modelling (PyBaMM)}`, start_line=481), Field(key=`author`, value=`Sulzer, Valentin and Marquis, Scott G. and Timms, Robert and Robinson, Martin and Chapman, S. Jon`, start_line=482), Field(key=`doi`, value=`10.5334/jors.309`, start_line=483), Field(key=`journal`, value=`Journal of Open Research Software`, start_line=484), Field(key=`publisher`, value=`Software Sustainability Institute`, start_line=485), Field(key=`volume`, value=`9`, start_line=486), Field(key=`number`, value=`1`, start_line=487), Field(key=`pages`, value=`14`, start_line=488), Field(key=`year`, value=`2021`, start_line=489)]]

Citations registered: 

Sulzer2021 was cited due to the use of Citations
Harris2020 was cited due to the use of Citations
Marquis2019 was cited due to the use of SPM

I've not added function of string formatting in citations.py yet hence this output. But that's just a matter of copy paste. Is this a good progress?

agriyakhetarpal commented 10 months ago

Yes, this looks good – thanks for sharing. Once you are ready with the string formatting, feel free to raise a PR! We can go forward with further discussions there, or here itself if you don't wish to raise a PR just yet.

prady0t commented 10 months ago

@agriyakhetarpal I've drafted a PR so that we can see progress.