Normalize field keys (to lowercase)

csware commented 9 months ago

Describe the bug I have several .bib files that contain (mixed) field keys that are either in lowercase or start with a capital letter, such as "Author" and "Title". No other tooling complains about this.

SeparateCoAuthors does not work and I cannot uniformy access the fields using e.g. entry['title']

A normalization to lowercase of the field keys was conducted in v1.

Maybe this can be fixed using a middleware? I would be really grateful!

Reproducing

Version: e3757c13abf2784bda612464843ab30256317e6c

Code:


#!/usr/bin/python

import bibtexparser
import bibtexparser.middlewares as m

layers = [
    m.LatexDecodingMiddleware(),
    m.MonthIntMiddleware(True), # Months should be represented as int (0-12)
    m.SeparateCoAuthors(True), # Co-authors should be separated as list of strings
    m.SplitNameParts(True), # Individual Names should be split into first, von, last, jr parts
    m.MergeNameParts("last", True) # Individual Names should be merged oto Last, First...
]

bib_database = bibtexparser.parse_file('data/Survey.bib', append_middleware=layers)
for entry in bib_database.entries:
    print(entry['title']);

Bibtex:

@InCollection{Name2006,
  Title                    = {A Title},
  Author                   = {Name, First and Name, Second},
  Booktitle                = {ITS},
  Publisher                = {Some publisher},
  Year                     = {2006},
  Pages                    = {61--70}
}

Remaining Questions (Optional) Please tick all that apply:

[ ] I would be willing to contribute a PR to fix this issue.
[ ] This issue is a blocker, I'd be grateful for an early fix.

tdegeus commented 9 months ago

Thanks!

[ ] We should add a middleware that normalizes field names.
[ ] We could consider a default lower-case mapping.

Technologicat commented 9 months ago

Maybe something like this (Works For Me™)?

import bibtexparser
from bibtexparser.library import Library
from bibtexparser.model import Block, Entry

class NormalizeFieldNames(bibtexparser.middlewares.middleware.BlockMiddleware):
    def __init__(self,
                 allow_inplace_modification: bool = True):
        super().__init__(allow_inplace_modification=allow_inplace_modification,
                         allow_parallel_execution=True)

    def transform_entry(self, entry: Entry, library: "Library") -> Union[Block, Collection[Block], None]:
        for field in entry.fields:
            field.key = field.key.lower()
        return entry

Usage example:

        library = bibtexparser.parse_file(filename,
                                          append_middleware=[NormalizeFieldNames(),
                                                             bibtexparser.middlewares.SeparateCoAuthors(),
                                                             bibtexparser.middlewares.SplitNameParts()])

tdegeus commented 9 months ago

That's probably alright. Would you be willing to convert it to a PR (adding a test)? I think this is a quite common use-case that we should support.

MiWeiss commented 9 months ago

Fully agree with @tdegeus, and would appreciate a PR by @Technologicat

Just one remark: We'd have to be able to handle "new" duplicates somehow (i.e., if two field keys exist in the original block which only differ in their capitalization). That's particularly important now that we're pushing the use of entries as dicts. In principle, we have an entry type DuplicateFieldKeyBlock that should be used here, but I am also happy to support additional suggestions. These would probably have to be enabled with a corresponding constructor parameter (e.g. raising an exception). Does this make sense?

Technologicat commented 9 months ago

@tdegeus: Sure.

@MiWeiss: Good point about conflicting keys. But I'll need a bit more information about the desired way to tackle it.

The way this approximately went is, yesterday I got a sudden need to extract some data from BibTeX in Python.

Within an hour, I had installed bibtexparser, upgraded it to 2.x, ran into this issue (since my datafiles happened to use capitalized keys), written the simplest possible field key normalizer, and posted a copy here. So it's fair to say I'm kind of new to this project :)

csware commented 9 months ago

A solution would be to issue a warning (similar to library.failed_blocks) and use the last key value.

Technologicat commented 9 months ago

@csware: Thanks. Yes, that's one possible solution, and probably the simplest one that works.

~Considering alternatives, what about the DuplicateFieldKeyBlock mentioned by @MiWeiss?~ EDIT: Nevermind, I think I understood what you all meant now.

Technologicat commented 9 months ago

Implemented, using @csware's suggestion of emitting a warning and letting the last value win. Please review.

sciunto-org / python-bibtexparser

Normalize field keys (to lowercase) #467