Abbreviate first names and including start page only in BibTeX export

retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts

https://retorque.re/zotero-better-bibtex/

MIT License

5.37k stars 288 forks source link

Abbreviate first names and including start page only in BibTeX export #836

Closed cronox85 closed 5 years ago

cronox85 commented 6 years ago

Hi, First of all thanks for the great effort you've put into BBT! It seems that with Zotero + BBT + ZotFile there is a solution for citation management that works pretty much perfect for our group.

We're physicists and most of the time we publish in APS journals using revtex4-1. I really like its option "longbibliograhy", including the title of the references. Unfortunately, "longbibliograhy" also results in full first names being displayed (if available) and pages being shown in the form of "start-end". However, full first names and page ranges are typically not available for all references leading to a rather inconsistent bibliography. I'm pretty pedantic in that respect ...

My questions (I've looked around but haven't found solutions so far):

Is there the possibility to abbreviate the first names of the authors in a BibTeX export? I'm aware that such an output is usually created via the bst-style-file. However, in order to ensure that everything properly carries over when submitting papers I'd strongly prefer to achieve this result via the bib-file and plain revtex4-1 + longbibliograhy. I'm wondering whether there is a solution either via a hidden preference or the Postscript window of BBT?
Is there a way to include the starting page only (and not the end page) in a BibTeX export? Presumably, that's a pretty similar issue.

Speaking of the postscript window: Your hack for switching on LaTeX in the title works nicely:

if (Translator.BetterBibTeX && this.has.title) {
  this.add({ name: 'title', value: this.item.title.replace(/(\$.*?\$)/g, '<pre>$1</pre>'), replace: true });
}

In contrast, your sorting algorithm for the BibTeX entry doesn't result in any changes in the output. Am I missing something?

if (Translator.BetterBibTeX) {
  var order = ['author', 'title', 'volume', 'volume', 'pages', 'year', 'doi'];
  this.fields.sort(function(a, b) {
    var oa = order.indexOf(a.name);
    var ob = order.indexOf(b.name);
    if (oa < 0) { return 1; }
    if (ob < 0) { return -1; }
    return oa - ob;
  });
}

I'm using Zotero 5.0.30 with Better-BibTeX 5.0.50 and ZotFile 5.0.6. Thanks in advance.

retorquere commented 6 years ago

The field sorting no longer works; BBT no longer has the fields property for the Bib(La)TeX translators, and the fields no longer live in an array but in a hash.

The other two are achievable; the pages could be done using

if (Translator.BetterBibTeX && this.item.pages)
  this.add({ name: 'pages', value: this.item.pages.replace(/[-\u2012-\u2015\u2053].*/, '' ), replace: true})

the author stuff could be done by fiddling with this.item.creators and then calling addCreators, but that will require some changes at the BBT end; addCreators isn't exposed to the postscript right now, and changing anything in this.item will likely mess up the cache until I check in a change that prevents the postscript from affecting the cache; I have the cache-protection lined up, but can't check in from where I am now.

retorquere commented 6 years ago

The name-fiddling will have to be done in a postscript BTW. I do not want to add more name manipulation to BBT, and name abbreviation is hairy. The name manipulation I do have just uses the Zotero citation processor, I haven't written it myself.

Just so you know what you may be up against, Zotero's (and most bibliography tools except recent biblatex versions) assumption that names are either one-part or firstname-lastname is a simplification that does not always hold true; name processing can get pretty insane. The name processing of the built-in citation processor does a decent job and I make good use of it, but it's certainly not even close to robust against that last list.

You can add a END OF GUARDED AREA (U+0097) character in a firstname and BBT will export the name in such a way that bib(la)tex will abbreviate names (if the style does abbreviation) such as Ph<U+0097>ilippe as Ph. and not P.. P. isn't even necessarily the best abbreviation for Philip even in English, and "Yuri Gagarin" (Юрий Гагарин) abbreviates to "Yu. Gagarin" since the "Yu" corresponds to a single Russian letter. I wish you well on the rabbit hole descent you're about to embark on 😆 .

This EOGA trick only works for a single initial at the moment, and doesn't work in anything but BBT; so far, citeproc seems unbothered by the zero-width character, does output it but it doesn't seem to affect rendering.

cronox85 commented 6 years ago

The field sorting no longer works; BBT no longer has the fields property for the Bib(La)TeX translators, and the fields no longer live in an array but in a hash.

That explains a lot. I had already doubts regarding my copy, paste & edit skills ... The sorting actually was only one of those things that are nice-to-have-but-not-necessary as it only affects the bib-file. So that's no major issue.

Thanks for the solution for the pages. It seems to work perferctly.

Regarding the first name abbreviation: I was (partly) aware that the whole story can become quite a mess. For instance, in order to prevent ambiguities, my Spanish colleagues use only one of their family names for publications and many (myself included) omit second names. And I know of quite some colleagues that "Americanize" their names out similar reasons. Still, I underestimated how complex that issue actually might become ... thanks for that webpage.

Nevertheless, as far as I can tell, the journals (at least those we are publishing in) also seem to use rather "simple" solutions and I would be totally happy with a postscript emulating the result. I guess the mechanism for abbreviations in revtex4-1 is also rather crude. Basically, I imagine something that cuts every first name from the "first name" field in Zotero down to the first (latin) letter, adds a ".", and retains whether there was a " " or a "-" between the names. So for instance "Doe, John" besomes "Doe, J.", "Doe, John Micheal" becomes "Doe, J. M.", and "Doe, John-Micheal" becomes "Doe, J.-M." (stupid example, I know).

Although this simple mechanism will create "wrong" abbreviations in certain cases (such as your example of Yuri Gagarin being abbreviated as Y. Gagarin) but, as mentioned above, the result corresponds to the name handling of the journals we commonly publish in.

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5090 ("addCreators").

retorquere commented 6 years ago

5090 has the required changes.

I'm not promising that I'll always be available to engineer this out because as mentioned, this can get quite complicated, and my life has become rather insanely busy recently, but this is a first stab. Please do not try this on a build you already have installed, only on 5090+, as you are probably changing the cache by doing this on existing pre-5090 builds.

    if (Translator.BetterBibTeX) {
      for (const creator of this.item.creators) {
        if (creator.firstName) {
          creator.firstName = creator.firstName.replace(/([A-Z])[a-z\u00C0-\u017F]*\.?/g, '$1.').replace(/ /g, '')
        }
      }

      this.addCreators()
    }

testing this in Zotero itself is a bit of a pain as would have to trawl the logs to see what isn't working as you want it; I would strongly advise installing node; that way you can test it using this script:

var Translator = { BetterBibTeX: true };

const references = require('./test/fixtures/export/(non-)dropping particle handling #313.json')

const ref = {
  addCreators() { console.log(this.item.creators) },

  postscript() {
    // start: this goes in the postscript field
    if (Translator.BetterBibTeX) {
      for (const creator of this.item.creators) {
        if (creator.firstName) {
          creator.firstName = creator.firstName.replace(/([A-Z])[a-z\u00C0-\u017F]*\.?/g, '$1.').replace(/ /g, '')
        }
      }

      this.addCreators()
    }
    // end: this goes in the postscript field
  }
}

for (const item of references.items) {
  ref.item = item
  ref.postscript()
}

You can export some references from your own library using the "BetterBibTeX JSON" exporter and stick that in the line that starts with const references.

Doing this I found another edge case: you will want to account for names that have non-ASCII letters, and names that already are written down as initials 🙄. I've adapted the script to (try to) handle those. This will require a lot of trial and error before you have what you want.

cronox85 commented 6 years ago

Ok, thanks so far.

I've downloaded node and will try around. Presumably, however, that won't happen before next Monday or Tuesday. I'll report back as soon as possible.

retorquere commented 6 years ago

It's not a necessity to use node (the postscript will work in 5090) but you'll get much quicker turnaround, and the zotero logs are pretty chatty with debugging on, so you'd spend a lot of time poring over those.

cronox85 commented 6 years ago

After some tests with my current library and some dedicated "complex" examples, I found three cases for which I wasn't happy with the result:

Kimura, Ren'iti was exported as Kimura, R.'iti
Şaşıoğlu, Ömer-Bob was exported as {\c S}a{\c s}{\i}o{\u g}lu, {\"O}mer-B.
James, LeBron DeMarcus was exported as James, L.B.D.M.

In addition, I prefer to keep the blanks between independent (abbreviated) first names. After trying around a bit with your code, I think I half-understood how it works and modified it accordingly. The ugly list of unicode characters corresponds to all capital letters in the range \u00C0-\u017F (following https://en.wikipedia.org/wiki/List_of_Unicode_characters).

if (Translator.BetterBibTeX) {
  for (const creator of this.item.creators) {
    if (creator.firstName) {
    creator.firstName = creator.firstName.replace(/([A-Z\u00C0-\u00DE\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C\u011E\u0120\u0122\u0124\u0126\u0128\u012A\u012C\u012E\u0130\u0132\u0134\u0136\u0139\u013B\u013D\u013F\u0141\u0143\u0145\u0147\u014A\u014C\u014E\u0150\u0152\u0154\u0156\u0158\u015A\u015C\u015E\u0160\u0162\u0164\u0166\u0168\u016A\u016C\u016E\u0170\u0172\u0174\u0176\u0178\u0179\u017B\u017D])[A-Za-z\u00C0-\u017F']*\.?/g, '$1.').replace(/ /g, ' ')
    }
  }
  this.addCreators()
}

Now the three persons listed above are exported as "Kimura, R.", "{\c S}a{\c s}{\i}o{\u g}lu, {\"O}.-B.", and "James, L. D.". Multiple first names seem to be handled properly. The same holds true for any "-" that are present. I will thoroughly test everything during the next days (I'm currently migrating my library into Zotero). In case I stumble over names that contain characters not handled yet, I just put them into the corresponding square brackets. But so far, I'm pretty happy with the solution/result. Thanks a lot!

retorquere commented 6 years ago

But I think you'll see now why I didn't want to bake this into BBT :laughing:. The .replace(/ /g, ' ') doesn't do anything I think -- if I'm reading that correctly, it says "replace all single spaces with a single space".

I cheated a bit and folded the 5090-changes into v5.0.54.

jotelha commented 5 years ago

A few days ago, when looking for a way to resolve the first name abbreviation issue, I came across this post https://stackoverflow.com/questions/9862761/how-to-check-if-character-is-a-letter-in-javascript/25850689#25850689 for first letter recognition and put it together into a BBT post-processing script before actually finding this BBT issue here. Similar to the regex snippet shown above, the script replaces whitespace-separated components of the first name field conditionally if their first character is identified as a "letter".

retorquere commented 5 years ago

I'm not sure what Zotero normalizes to during export, but if it doesn't do any normalization,

{ firstName: 'Étienne'.normalize('NFC'), lastName: 'Gilson' }

(which is in NFD) would get abbreviation E. rather than É.

github-actions[bot] commented 3 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.