sciunto-org / python-bibtexparser

Bibtex parser for Python 3
https://bibtexparser.readthedocs.io
MIT License
468 stars 130 forks source link

:bug: Add encoding to file writing function #405

Open claell opened 1 year ago

claell commented 1 year ago

Fixes #403

claell commented 1 year ago

Right now, this is just adapted from #395. Not tested, just quickly done in GitHub editor.

claell commented 11 months ago

Sorry, am rather busy right now, so not sure when (or if) I'll have time for that.

MiWeiss commented 11 months ago

Sorry, am rather busy right now, so not sure when (or if) I'll have time for that.

Thanks for the comment. I'll mark the PR to be ready for anyone to continue working on this.

If anyone is willing to do so, or if @claell resumes his work: Please comment here to avoid having two people working on the same thing at the same time.

p.s. Continuing to work on the PR can be done as follows: Add the fork of @claell as additional git remote, pull the branch, and then push it to your own fork and open a new PR, stating that that one is supposed to replace this one.

agriyakhetarpal commented 8 months ago

Hi, we're looking to add python-bibtexparser as a dependency for the citations workflow of our Python package soon – I thought it would be nice to contribute as well, perhaps by unblocking a few PRs; such as this one and suchlike.

I am happy to take over from @claell and write a few tests, though the issue is that I fail to reproduce the original issue (#403) at this time:

Using the core functionality, i.e., without any customizations to the formatting options for writing, I am able to write an entry:

import bibtexparser
from bibtexparser import *
bib_library = bibtexparser.Library()
fields = []
fields.append(bibtexparser.model.Field("author", "ö"))
entry = bibtexparser.model.Entry("ARTICLE", "test", fields)
bib_library.add(entry)
bibtexparser.write_file("my_new_file.bib", bib_library)

i.e., the same MWE as previously reported in https://github.com/sciunto-org/python-bibtexparser/issues/403#issue-1905224824, and parsing my_new_file.bib seems to work without issues.

my_new_file.bib

@ARTICLE{test,
author = {ö}
}

and reading this programmatically:

import bibtexparser
library = bibtexparser.parse_file("my_new_file.bib")

Therefore, library.entries_dict returns

{'test': Entry(entry_type=`article`, key=`test`, fields=`[Field(key=`author`, value=`ö`, start_line=1)]`, start_line=0)}

which I am able to write to a new .bib file via the bibtexparser.write_file() method without any loss of data or missing umlauts symbols. Has this been fixed, or cannot be reproduced, or there is a different method of looking at the error that I may have missed in oversight?

MiWeiss commented 8 months ago

Hi @agriyakhetarpal

Great, thanks a lot for taking over!

Strange to hear that the problem cannot be reproduced, I am not aware of any changes we merged recently (although I have not checked ;-) ). As we're talking about encoding, I would not be surprised if the behavior was somewhat system-dependent, which could explain the issue at hand.

In either case, I think we should still merge this PR - to keep the interface consistent and generally applicable. While it's not ideal to do so without reproducing the problem above, I'd suggest implementing tests similar to #395 - this should, at the very least, protect us from some regressions. Do you agree?

I thought it would be nice to contribute as well

That would be amazing. I'll be happy to help you along the way wherever I can (feedback, reviews, ...)

agriyakhetarpal commented 8 months ago

As we're talking about encoding, I would not be surprised if the behavior was somewhat system-dependent, which could explain the issue at hand.

In either case, I think we should still merge this PR - to keep the interface consistent and generally applicable. While it's not ideal to do so without reproducing the problem above, I'd suggest implementing tests similar to #395 - this should, at the very least, protect us from some regressions. Do you agree?

I'll be happy to help you along the way wherever I can (feedback, reviews, ...)

I could not agree more. I am aware systems like Windows choose CP1252 by default for writing to files if UTF-8 isn't specified – maybe the reason I'm not seeing this error is because I'm on macOS?

Is it fine if we can discuss things here itself before I proceed to write a PR? I couldn't wrap my head around the source code for the middlewares and customisers for now—very custom engineering TBF—but I did manage to write a simple test for test_writer.py, as follows.

def test_write_article_with_umlauts():
    entry_block = Entry(
        entry_type="article",
        key="myKey",
        fields=[
            Field(key="title", value='"myTitle"'),
            Field(key="author", value='"Müller, Max"'),
        ],
    )
    library = Library(blocks=[entry_block])
    string = writer.write(library)
    assert string == '@article{myKey,\n\ttitle = "myTitle",\n\tauthor = "Müller, Max"\n}\n'

I would appreciate feedback on this, afterwards we should be able to pytest-parameterize this with a few other characters (ö, ä, ë, and other diacritics you can think of!)

Edit: removed some redundant print statements, was just debugging something to stdout

MiWeiss commented 8 months ago

Hi @agriyakhetarpal

Your suggested test, as I read it, would not actually test the method targeted in this PR.

Instead, you would have to test the method bibtexparser.write_file and included the newly added encoding parameter. This will write a new bibtex file. You could then check if this file contains the expected content in the expected encoding.

agriyakhetarpal commented 8 months ago

Instead, you would have to test the method bibtexparser.write_file and included the newly added encoding parameter. This will write a new bibtex file. You could then check if this file contains the expected content in the expected encoding.

Makes sense. I improved the test such that it parses a given BibTeX string with umlauts symbols, writes to a temporary file, and reads from it; as follows:

def test_write_file_with_umlauts():
    bibtex_str = """@article{umlauts,
    author = {Müller, Hans},
    title = {A title},
    year = {2014},
    journal = {A Journal}
    }"""
    library = parse_string(bibtex_str)
    with tempfile.NamedTemporaryFile(mode="w", encoding="utf-8") as f:
        write_file(f, library, encoding="utf-8")
        f.seek(0)
        library = parse_file(f.name, encoding="utf-8")
    assert library.entries[0]["author"] == "Müller, Hans"
    assert library.entries[0]["title"] == "A title"
    assert library.entries[0]["year"] == "2014"
    assert library.entries[0]["journal"] == "A Journal"

I opted to use temporary files so as to not create any clutter. Does this look fair enough? I can then write a few more tests or parameterize this as needed, perhaps with a few more characters.