jaimergp / fixbibtex

Fix BibTeX databases with Crossref metadata
MIT License
9 stars 1 forks source link

Enhancement: repairing latin binomials #2

Open jrjhealey opened 6 years ago

jrjhealey commented 6 years ago

Hi Jaime,

Possible enhancement for you!

If pybtex doesn't already correct this, it would be good if this can also incorporate the fix for correctly italicising Latin bionomials (fairly simple search-and-replace to switch HTML italics tags, to TeX format tags. There's an old script online (below) which does essentially this, but isn't the best Python in the world... Inspired by:

https://twitter.com/MendeleySupport/status/776001527664156672

and

https://itskathylam.wordpress.com/2016/01/12/dealing-with-italics-in-bibtex-files-exported-from-mendeley/

#!/usr/bin/python

# By: Kathy Lam
# Date: January 11, 2016
# Purpose: Replace all instances of "<i>" with "\textit{"
#          and "</i>" with "}" in bibtex file generated by Mendeley

oldbib = open("bibliography.bib", "r")
newbib = open("new_bibliography.bib", "w")

for line in oldbib:
    if line.startswith("title"):
        if "<i>" in line:
            fixed_open_tags = line.replace("<i>", "\\textit{")
            fixed_both = fixed_open_tags.replace("</i>", "}")
            newbib.write(fixed_both)
        else:
            newbib.write(line)
    else:
        newbib.write(line)

If there was some logic to catch and handle duplicate entries that would be really useful too (a problem I end up with quite often).

Cheers!

Joe

jaimergp commented 6 years ago

Hi Joe! Thanks for the feedback.

I'd say we should regex against some common HTML code in titles (italics, subscript, and superscript, mainly). Do you have any examples at hand?

For the duplicate entries, let's create a separate issue.

jrjhealey commented 6 years ago

Yep ok good idea! I'll open another issue for duplicates.

I'll commit a folder of different examples that I come up with to my fork of the repo, and then make a PR so you can test against them too perhaps?

Currently what I've thought of are an example of:

In my experience it's quite good at converting special characters in names etc so that's probably enough to cover 90% of the troublesome refs.

Edit:

It looks like subs/superscript might be difficult, as Mendeley (which I export my bib files from), just coerces them to normal case letters/numbers (they have no HTML around them).