retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.2k stars 284 forks source link

Capitalization on BibLaTeX export of text marked as italics #969

Closed tolot27 closed 6 years ago

tolot27 commented 6 years ago

The report ID is: 9U4TFWT8

My title contains the species name Brucella melitensis which should be printed in italics without case changes (according to the official taxonomic classification and even for English references). Hence, I surrounded it by <i>...</i>, the "official" way to do it. If I export it using the Zotero BibLaTex export, everything is fine, and I get:

@article{bricker2000,
  title = {Characterization of the three ribosomal {RNA} operons {rrnA}, {rrnB}, and {rrnC}, from \textit{Brucella melitensis}},
}

But I export it using Better-BibLaTeX to keep it updated, the output changes dramatically and the species cannot be rendered correctly anymore, regardless of the CSL (i. e. Havard style).

@article{Bricker2000,
  title = {Characterization of the Three Ribosomal {{RNA}} Operons {{rrnA}}, {{rrnB}}, and {{rrnC}}, from {{{\emph{Brucella}}}}{\emph{ Melitensis}}},
}

Indeed, if I protect it with <span class="nocase">...</span>, the species name gets protected and rendered correctly. But changing thousands of my references is not the way to go and messes up the Zotero list view, too.

Also, there are too many curly braces arround the \emph{}. None of them covers the complete species name formaly flanked by <i>...</i>. Every single word gets a {{{\emph{}}}} (three sourrounding curly braces. That makes no sense to me.

If I set the language of the reference to en, nothing changes, except that langid = {english}, is added to the output.

Conclusion: I'm very happy with Zoteros BibLaTeX export except that it does not update the exported file automatically. But it does also protect uppercase words or letters like RNA and rrnA in the above example. BibLaTeX as well as pandoc-citeproc and even the Zotero Citation Preview work very well with the native Zotero BibLaTeX output format but all three fail to render Better-BibLaTeX correctly.

It would be great if there would be just an extension to the Zotero export formats which updates the exported files automatically.

retorquere commented 6 years ago

OK, so the source title is Characterization of the three ribosomal RNA operons rrnA, rrnB, and rrnC, from <i>Brucella melitensis</i>, and you want the export to not capitalize the output, correct?

The reason BBT capitalizes titles is explained here -- if you don't want that, you can suppress it, but it does mean your BibTeX does not express the same intent as it has in Zotero (which is OK if you only use Zotero for BibTeX management). The correct way to suppress capitalization of proper names is, indeed, the nocase markup -- it's not pretty, but that's what Zotero supports. If you want to suppress it per-reference, you'd have to set the language to something not-English. The reference you posted will (by accident) render properly in bib(la)tex but not in Zotero itself (depending on the style).

The reason for the extra braces is explained here. The stock BibTeX exported by Zotero does indeed look simpler but it doesn't always do what it should -- bracing in BibTeX has tons of edge cases (see that last link) that BBT handles more correctly, at the cost of uglier BibTeX code. The brace handling of BBT cannot be suppressed. \textit{Brucella melitensis} excludes Brucella melitensis from biblatex style case-changes, which in your case is what you want because it's a proper name, but in the general case it's wrong, as italics doesn't automatically mean it's a proper name.

tolot27 commented 6 years ago

Yes, I don't want to capitalize the output on export. The BibLaTeX processor should handle this. I just want to get protection of words with uppercase letters like rrnA. In JabRef I can put {} arround it. This is done by Zotero and BBT automatically, which is very nice. Unfortunatly, it changes the case of certain words to uppercase, which I do not want and is sometimes incorrect, especially for species names.

Setting the language of an English reference (as most references are) is semantically incorrect and it disables the nice protection of words with uppercase letters. Setting suppressTitleCase to on not only disables the uppercasing of words in the title, it also disables the nice protection of words with uppercase letters. IMHO that are two different features which will be enabled/disabled by one configuration property. Can you split the uppercase protection feature into a separate configuration property, like protectTitleUpcase?

Ah, <i class='nocase'>...</i> works too, which is much better than the additional span. :-)

BTW: If a title starts with a species name, the species name gets not capitalized. <i>Brucella melitensis</i> chromosomes. works very well.

retorquere commented 6 years ago

Yes, I don't want to capitalize the output on export. The BibLaTeX processor should handle this.

You're missing my point. The BibLaTeX processor does handle this, but it does this assuming it is handed a Title Cased title, which is why BBT transforms the title from the Zotero Sentence case to Title Case on export.

I just want to get protection of words with uppercase letters like rrnA. In JabRef I can put {} arround it. This is done by Zotero and BBT automatically, which is very nice. Unfortunatly, it changes the case of certain words to uppercase, which I do not want and is sometimes incorrect, especially for species names.

Which is why you need the nocase stuff. BBT must convert to Title Case to offer what the biblatex processor expects.

Setting the language of an English reference (as most references are) is semantically incorrect and it disables the nice protection of words with uppercase letters.

That is correct, I wouldn't advice doing so, but you mentioned playing with the language field, I was just explaining what effect that field has.

Setting suppressTitleCase to on not only disables the uppercasing of words in the title, it also disables the nice protection of words with uppercase letters.

Right, true.

IMHO that are two different features which will be enabled/disabled by one configuration property. Can you split the uppercase protection feature into a separate configuration property, like protectTitleUpcase?

There were reasons not to. I'll have to dig up the discussion where we came to that conclusion.

Ah, ... works too, which is much better than the additional span. :-)

Yep, <span> is really just a way to get the class in there that doesn't affect the output otherwise.

BTW: If a title starts with a species name, the species name gets not capitalized. Brucella melitensis chromosomes. works very well.

For me, <i>Brucella melitensis</i>, A note on Hanf numbers gets turned into {\emph{Brucella Melitensis}, {{A}} Note on {{Hanf}} Numbers},

retorquere commented 6 years ago

https://tex.stackexchange.com/questions/10772/bibtex-loses-capitals-when-creating-bbl-file/140071#140071

retorquere commented 6 years ago

https://github.com/retorquere/zotero-better-bibtex/issues/383

retorquere commented 6 years ago

@njbart, I think I had that discussion where we came to this conclusion (brace-protection and title-casing go hand in hand) with you, do you recall the context?

tolot27 commented 6 years ago

For me, <i>Brucella melitensis</i>, A note on Hanf numbers gets turned into {\emph{Brucella Melitensis}, {{A}} Note on {{Hanf}} Numbers},

For me too, but since it is not protected, the BibLaTeX processor lowercases Melitensis - depending on the CSL. This is according to your linked StackExchange comment. There is also stated: "regular words must be capitalized, but not enclosed in braces". Unfortunately, that happens per default and is IMHO not according to the "spec" and I must use nocase.

In my opinion, every word containing an uppercase letter should be protected first. Afterwards, title casing should be performed without any additional uppercase protection ({...}) but only for non-protected words. That would fulfil stated sentence in linked the StackExchange comment: "You must write the title in the capitalized form, and your bst style either keeps it this way or converts it to lower case."

According to my previous suggestion, <i>Brucella melitensis</i> would be converted to \emph{{Brucella} Melitensis} regardless of the position in the title. This fulfils title case and Brucella is protected against lower casing. This works well much more citation styles out of the box without using nocase, except for citation styles which capitalize again like the American Anthropological Association style. In that case, neither \emph{Brucella melitensis}, {\emph{Brucella melitensis}}, {{\emph{Brucella melitensis}}}, {{{\emph{Brucella melitensis}}}}or \emph{{Brucella melitensis}} works/is correct, only \emph{Brucella {melitensis}}. Only in this case, nocase would be required and should convert <i>Brucella melitensis</i> to \emph{{Brucella} {melitensis}} by protecting individual lowercased and uppercased words with {...}. Then it works for all citation styles independently, regardless if they change capitalization, perform uppercasing or perform lowercasing.

Notable, my observations/conclusions regarding the protection/recaptialization/lowercasing of \emph{} is different to your observations mentioned in Why the double braces?. It looks like {...} work only inside \emph{...}.

retorquere commented 6 years ago

(I think the discussion with @njbart happens here: https://github.com/retorquere/zotero-better-bibtex/issues/541)

For me too, but since it is not protected, the BibLaTeX processor lowercases Melitensis - depending on the ~CSL~BST. This is according to your linked stackexchange comment. There is also stated: "regular words must be capitalized, but not enclosed in braces". Unfortunately, that happens per default

Why unfortunately? It seems desirable that BBT would produce references in the way that biblatex expects them.

and is IMHO not according to the "spec" and I must to use nocase.

What spec though?

In my opinion, every word containing an uppercase letter should be protected first.

and BBT does this.

Afterwards, title casing should be performed without any additional uppercase protection ({...}) but only for non-protected words.

and BBT does this, too.

That would fulfil stated sentence in linked the stackexchange comment: "You must write the title in the capitalized form, and your bst style either keeps it this way or converts it to lower case."

This is what BBT does as far as I can tell.

According to my previous suggestion, Brucella melitensis would be converted to \emph{{Brucella} Melitensis} regardless of the position in the title.

But \emph{{Brucella} Melitensis} would prevent biblatex from downcasing Melitensis -- the braces around Brucella don't do anything here. In the case of a proper name this may be a desired side effect, but it is not in general. This is why I'm generating (the admittedly ugly-looking) {{{\emph{Brucella}}}}{\emph{ Melitensis}}:

This fulfils title case and Brucella is protected against lower casing.

but also "protects" Melitensis, which is not desired in the general case. I find that behavior from biblatex strange myself, but that behavior is fixed.

This works well much more citation styles out of the box without using nocase, except for citation styles which captitalize again like the American Anthropological Association style. In that case, neither \emph{Brucella melitensis}, {\emph{Brucella melitensis}}, {{\emph{Brucella melitensis}}}, {{{\emph{Brucella melitensis}}}}or \emph{{Brucella melitensis}} works/is correct, only \emph{Brucella {melitensis}}. Only in this case, nocase would be required and should convert Brucella melitensis to \emph{{Brucella} {melitensis}} by protecting individual lowercased and uppercased words with {...}. Then it works for all citation styles independently, regardless if they change capitalization, perform uppercasing or perform lowercasing.

This is not my understanding of how this works. Granted, I'm not a biblatex expert, but @njbart is, and from what he's told me, only one of the versions you list here is sort of equivalent to <i>Brucella melitensis</i>. They'd be, roughly (I think):

Notable, my observations/conclusions regarding to the protection/recaptialization/lowercasing of \emph{} is different to your observations mentioned in Why the double braces?. It looks like {...} work only inside \empf{...}.

This is not what I've been led to believe.

tolot27 commented 6 years ago

This is not what I've been led to believe.

But that is what I observe. I'll try to great an MWE and test if it depends on pdflatex, xelatex, biblatex with or without backend=biber and pandoc-citeproc.

retorquere commented 6 years ago

@njbart, I've overwritten the MWE we had for this. Do you still have it/can you reconstruct it?

retorquere commented 6 years ago

@tolot27 pandoc-citeproc is interesting information, but if pandoc-citeproc does it one way, and an actual latex processors another, the latex behavior is going to be counted as authoritative.

Right now I can't find any way to get sentence case at all in an MWE.

tolot27 commented 6 years ago

I think I figured it out. The main problem seems to be the conversion of BibLaTeX using CSL files, i. e. with pandoc-citeproc and even Zotero+Microsoft Word without any BibLaTex. Many CSL files like MLA or APA contain text-case attributes which change the capitalization. The current attempts of BBT to enforce title case thwart this. On plain LaTeX environments, BibLaTeX style packages like MLA or Chicago do not enforce title case. They expect it as stated in the linked the StackExchange comment: "You must write the title in the capitalized form, and your bst style either keeps it this way or converts it to lower case."

Here is a minimal working example titleCase.bib file (produced by BBT based on an entry containing Brucella melitensis in the title and no language set) with different combinations of casing and bracing added. The first four combinations are the most important once.

@article{titleCase,
  title = {Title Casing of species names:\newline
  unprotected emph: \emph{Brucella melitensis}, \newline
  {BBT} default: {{{\emph{Brucella}}}}{\emph{ Melitensis}}\newline
  triple-protected emph (={BBT}+nocase): {{{\emph{Brucella melitensis}}}},\newline
  unprotected emph+protected {Brucella} and {melitensis}: \emph{{Brucella} {melitensis}},\newline\newline
  protected emph: {\emph{Brucella melitensis}},\newline
  double-protected emph: {{\emph{Brucella melitensis}}},\newline
  titleCase emph: \emph{Brucella Melitensis},\newline
  protected titleCase emph: {\emph{Brucella Melitensis}},\newline
  double-protected titleCase emph: {{\emph{Brucella Melitensis}}},\newline
  triple-protected titleCase emph: {{{\emph{Brucella Melitensis}}}},\newline
  unprotected emph+protected Brucella + titleCase: \emph{{Brucella} Melitensis},\newline
  {unprotected emph+protected melitensis}: \emph{Brucella {melitensis}},\newline
  },
  journaltitle = {Minimal Working Examples},
  date = {2018-05-08},
  author = {Walter, Mathias C.}
}

A MWE titleCase.tex:

\documentclass{article}
%\usepackage[style=apa,backend=biber]{biblatex}
\usepackage[style=mla,backend=biber]{biblatex}
\addbibresource{titleCase.bib}
\begin{document}
\nocite{*}
\printbibliography
\end{document}

A MWE titleCase.md:

---
bibliography: titleCase.bib
#csl: apa.csl
csl: modern-language-association.csl
nocite: |
  @titleCase
...

In the examples, MLA is set. Creating a PDF with pdflatex titleCase && biber titleCase && pdflatex titleCase && pdflatex titleCase produces:

image

Creating a PDF using pandoc -s -t latex -f markdown -F pandoc-citeproc --pdf-engine=xelatex titleCase.md -o titleCase_md_mla.pdf produces:

image

Unfortunately, CSL does not seem to support \newline. Nevertheless, it can be seen that only my proposed \emph{{Brucella} {melitensis}} is rendered correctly, if a CSL is applied.

Switching over to the APA style by (un)commenting the appropriate lines in the tex and md file. pdflatex produces:

image

And pandoc produces:

image

Now BBT+nocase (<i class='nocase'>Brucella melitensis</i>) and my proposed conversion without nocase gets rendered correctly using pdflatex and pandoc. A second advantage of my proposal is that it is unnecessary to add nocase to lots of entries.

It would be interessting if there are cases which really require nocase. Maybe only if a letter/word should be kept in lower case and CSL in combination with MLA-like styles is used. But currently, I don't know if BBT would convert a <span class='nocase'>foo</span> into {foo}.

tolot27 commented 6 years ago

BTW: The Zotero MS Word AddIn neither supports <i class='nocase'>...</i> nor <span class='nocase'>...</span>, but this is off-topic here.

retorquere commented 6 years ago

I think we're getting our terminology crossed here. When I say "protect", I mean, "makes it so that biblatex will not change the case of this snippet". Given that, I think the first sample already shows the problem in apa; given that you say

unprotected emph: \emph{Brucella melitensis}

the expected outcome would have been

unprotected emph: brucella melitensis

because unprotected means that whatever is in there should have gotten sentence-cased. Since it doesn't (as you show, it renders to "unprotected emph: Brucella melitensis"), the \emph actually does "protect" it's content. Here's my interpretation:

\documentclass{article}

\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}

  @article{c0,
    title = {A title with Brucella melitensis (expected: lowercase)},
    journaltitle = {MWE},
    date = {2000-05-08},
    author = {Walter00, M.}
  }

  @article{c1,
    title = {A title with Brucella Melitensis  (expected: lowercase)},
    journaltitle = {MWE},
    date = {2001-05-08},
    author = {Walter01, M.}
  }

  @article{c2,
    title = {bare emph: \emph{Brucella melitensis} (note: brucella not lowercased, as a bare emph 'protects' its contents )},
    journaltitle = {MWE},
    date = {2002-05-08},
    author = {Walter02, M.}
  }

  @article{c3,
    title = {{BBT} default: {{{\emph{Brucella}}}}{\emph{ Melitensis}} (note: Brucella remains uppercased, but Melitensis is lowercased, as it has become 'unprotected')},
    journaltitle = {MWE},
    date = {2003-05-08},
    author = {Walter03, M.}
  }

  @article{c4,
    title = {triple-protected emph (={BBT}+nocase): {{{\emph{Brucella melitensis}}}} (this just 'protects' everything inside, so brucella is not lowercased)},
    journaltitle = {MWE},
    date = {2004-05-08},
    author = {Walter04, M.}
  }

  @article{c5,
    title = {unprotected emph+protected {Brucella} and {melitensis}: \emph{{Brucella} {melitensis}} (the inner braces don't do anything as the bare emph 'protects' everything inside)},
    journaltitle = {MWE},
    date = {2005-05-08},
    author = {Walter05, M.}
  }

  @article{c6,
    title = {\emph{un}protected emph: {\emph{Brucella melitensis}} (the outer braces \emph{un}protect from case meddling, so everything is now lowercase)},
    journaltitle = {MWE},
    date = {2006-05-08},
    author = {Walter06, M.}
  }

\end{filecontents}

\usepackage[style=apa,backend=biber]{biblatex}
%\usepackage[style=mla,backend=biber]{biblatex}
\addbibresource{\jobname.bib}
\begin{document}
\nocite{*}
\printbibliography
\end{document}
retorquere commented 6 years ago

There is a different problem though, and I think there you're right; if I have Zotero generate an APA title for There's something about <i>Brucella melitensis</i>, A note on Hanf numbers, I get

There’s something about Brucella melitensis, A note on Hanf numbers.

In order for that to render to the same through latex, I would have to produce

There’s Something about \emph{Brucella melitensis}, A Note on {{Hanf}} numbers.

(or the equivalent There’s Something about {{\emph{Brucella melitensis}}}, A Note on {{Hanf}} numbers.)

This is where the different baseline assumptions in biblatex and citeproc come to the fore. Citeproc expects its source format to be sentence-cased, so for a sentence-case style, it simply doesn't do anything with the text; it doesn't actively sentence-case it. If a title-case style is chosen, it (incorrectly from your POV) uppercases Melitensis.

The proper solution to this is still to explicitly case-protect the proper name Brucella melitensis. That way, it will work as required both in Zotero and in biblatex. Because of the different paths (sentence casing is do-nothing in Zotero, and it's active-from-titlecased in biblatex), I don't really see how I can fix this without formatting hints. Unfortunate, but unless either biblatex or citeproc changes their default assumption (unlikely), this looks like it's here to stay.

WRT 'nocase', Zotero only supports <span class="nocase"> (double-quotes) AFAICT. That BBT accepts more is an intended quirk of its markup parser; I find single-quotes more pleasing to look at, and <i class=''> is valid HTML that I think should be treated as equivalent to <i><span class="">, but that's not up to me for Zotero/citeproc.

WRT differences between pandoc and biblatex: I am taking what biblatex renders as the correct output, so if there's differences in rendering, I'd say that's a pandoc bug. But for pandoc, I'd reccommend my CSL or YAML exports, which bypasses all this case-meddling problems.

tolot27 commented 6 years ago

the \emph actually does "protect" it's content

It protects its content only in native BibLaTeX/Biber but not if the produced BibLaTeX file is further used with CSL.

Also, your assumption of lower casing is not correct as you can see if you render your interpretation using apa vs mla style. Ah, I see, maybe your misinterpretation comes from ShareLatex.com. It does not fully compile because ShareLatex (and also Overleaf) uses an old version of BibLaTeX (3.4 instead of 3.11) and Biber (2.5 instead of 2.11). I compile it locally using MikTeX but can try it with TexLive under Ubuntu, too if you like.

Anyway, both solutions {BBT}+nocase: {{{\emph{Brucella melitensis}}}} and \emph{{Brucella} {melitensis}} work with both BibLaTeX styles (apa, mla), but only \emph{{Brucella} {melitensis}} works with CSL (because of the text-case attribute). This has nothing to do with pandoc-citeproc. Any CSL processor will apply text-case if provided.

Regarding to your last comment of There’s Something about \emph{Brucella melitensis}, A Note on {{Hanf}} numbers. or their equivalent There’s Something about {{\emph{Brucella melitensis}}}, A Note on {{Hanf}} numbers., LaTeX will render this correctly but not a CSL processor like pandoc-citeproc.

I assume that the text-case processor is word based and change the capitalization of any word which is not protected. In \emph{{Brucella} {melitensis}} every word is protected inside \emph{} but with \emph{Brucella melitensis} no word is protected inside \emph{}.

What prevents you from converting <i>Brucella melitensis</i> into \emph{{Brucella} {melitensis}}? There are no double braces and no nocase is required (great usability improvement). And it has a creater compatibility with both LaTeX and CSL.

tolot27 commented 6 years ago

Do you think it is a good idea to raise a citeproc issue for text-case? It should respect that \emph{...} protects its content for capitalization changes.

retorquere commented 6 years ago

It protects its content only in native BibLaTeX/Biber but not if the produced BibLaTeX file is further used with CSL.

But biblatex/biber is the primary use-case for BBT. I don't actually know what pandoc-citeproc does, but I would assume it changes the bib file to CSL internally first and then passes that through citeproc, in which case:

  1. I would recommend not doing that and using BBT-CSL or BBT-YAML instead, as it's the native format it speaks, and it has the same base assumptions about casing as Zotero does so a lot of these problems just vanish, and
  2. If pandoc-citeproc renders something different than biblatex/biber, I would say that's an error in pandoc-citeproc (or any other CSL-based toolchain).

Also, your assumption of lower casing is not correct as you can see if you render your interpretation using apa vs mla style.

I did render both, but the problems showed when I picked apa

Ah, I see, maybe your misinterpretation comes from ShareLatex.com. It does not fully compile because ShareLatex (and also Overleaf) uses an old version of BibLaTeX (3.4 instead of 3.11) and Biber (2.5 instead of 2.11).

That is unfortunate. I do use these services. You're saying in newer versions, a bare \emph no longer case-protects? @plk, could you shed some light on how \emph{...} interplays with capitalization by styles in the newer biblatex/biber?

Anyway, both solutions {BBT}+nocase: {{{\emph{Brucella melitensis}}}} and \emph{{Brucella} {melitensis}} work with both BibLaTeX styles (apa, mla)

Well sure, and it could be done even simpler with \emph{Brucella melitensis}, because they all fully protect what's inside, but it's not actually what <i>Brucella melitensis</i> means; there is no expressed intent for case-protection for melitensis in that (see also below on the meaning Zotero attaches to <i>).

but only \emph{{Brucella} {melitensis}} works with CSL (because of the text-case attribute). This has nothing to do with pandoc-citeproc. Any CSL processor will apply text-case if provided.

Be that as it may, no CSL processor I know applies CSL styles to biblatex references, they are applied to CSL objects; CSL styles reference fields that biblatex simply doesn't offer. How the CSL toolchain translates biblatex into CSL objects has a major influence here. This translation is complex and can fully understand why shortcuts and heuristics are used over completeness; it can never be complete in the face of any and all valid biblatex input. CSL is a simple macro expansion system, biblatex inherits from TeX a turing-complete language, that can (re!)define commands mid-flight.

Anyhow, if I can export something both can use, then great, I'll do that, but if I have to choose (and so far, it looks to me like I must choose to preserve intent, see below), I will always choose biblatex/biber.

Regarding to your last comment of There’s Something about \emph{Brucella melitensis}, A Note on {{Hanf}} numbers. or their equivalent There’s Something about {{\emph{Brucella melitensis}}}, A Note on {{Hanf}} numbers., LaTeX will render this correctly but not a CSL processor like pandoc-citeproc.

But CSL processors don't render biblatex. They render CSL objects, usually expressed as CSL-JSON or CSL-YAML on disk. AFAICT, CSL processors that "use" biblatex first convert to CSL objects, and I have no idea what happens in that translation.

I assume that the text-case processor is word based and change the capitalization of any word which is not protected.

Loosely, yes. @fbennett would know more about this. I use the title-caser from citeproc-js, so BBT capitalizes as would Zotero.

In \emph{{Brucella} {melitensis}} every word is protected inside \emph{} but with \emph{Brucella melitensis} no word is protected inside \emph{}.

For the CSL toolchains perhaps. In the biblatex toolchains they are protected, and I consider their behavior authoritative. Any deviation I would consider a bug, even if I do find the biblatex behavior strange.

What prevents you from converting <i>Brucella melitensis</i> into \emph{{Brucella} {melitensis}}?

That the former does not express any intent on case handling, and the latter does. I'd have no basis to decide to brace melitensis (and that's aside the fact that the inner braces don't do anything, as the bare emph already does the work).

There are no double braces and no nocase is required (great usability improvement). And it has a greater compatibility with both LaTeX and CSL.

But it changes the meaning of the markup while it does so. Rendering <i>Brucella melitensis</i> into \emph{{Brucella} {melitensis}} would mean that melitensis would never be uppercased, where <i>Brucella melitensis</i> clearly does get uppercased when you generate a Chicago bibliography in Zotero (right-click the reference and select "create bibliography from item", and paste that into a word processor). That indicates that in Zotero, <i> does not automatically express a nocase intent, so I don't automatically want to generate one of my own. If I had wanted to do that I could have much easier just wrapped the whole title with an extra brace and be done with it. But then I'd also have to assume that you have your titles in Zotero in Title Case, and Zotero(/citeproc) expects them to be Sentence case.

Do you think it is a good idea to raise a citeproc issue for text-case? It should respect that \emph{...} protects its content for capitalization changes.

I think it's a bug, yes. But if they're going there, I would also argue that they'd have to respect that {\emph{...}} should unprotect its content for caps changes. This gets complicated pretty fast.

tolot27 commented 6 years ago

Maybe we should look from the other side, the LaTeX side and keep BBT and CSL out for a moment. If we accept that \emph{Brucella melitensis} protects its content for case changes, why changing them into {{{\emph{Brucella melitensis}}} (1st case) or even {{{\emph{Brucella}}}}{\emph{ Melitensis}} (2nd case)? No bib editor or processor does this and no human editor will do so. Also, BibLaTeX makes no assumptions of text or sentence case inside \emph{}. It just treats it as it is. The second case has even more disadvantages because {\emph{ Melitensis}} keeps now capitalized in MLA styles, which is not intended. Furthermore, the splitting of the two words inside one \emph{} into two different \emph{} affects word spacing (microtyping) and line breaking. Think of a protected space between the two words, the typical case for the abbreviated form B. melitensis, which should not break inside at line ends. {{{\emph{B.}}}}{\emph{ Melitensis}} will ever break, regardless of a non-breaking space before Melitensis. Considering these problems with the 2nd case, BBT should never produce it and not at all per default, IMHO.

Semantically, \emph means emphasize something whereas \textit means just formatting some words in italics. The default visual output of \emph is also italics but in nested environments like inside \textit, \emph changes its output to non-italics and therefore keeps it emphasized. For bibliography processing \emph is prefered over \textit because it keeps emphasizes also for book styles where the complete title is typically in italics. Nevertheless, \textit{} also protects its content from case changes and also makes no assumption of title case.

If we now consider both worlds, LaTeX and CSL or even the subset of markup commands which seem to be borrowed from HTML, we have <i> for text formatting and <em> for emphasizing, both do not make any case assumptions and neither do casing changing itself nor protect for case changes. See Usage Notes for details in HTML. A 1:1 translation from HTML to LaTeX or vice versa would be <i> <-> \textit and <em> <-> \emph. Unfortunately, Zotero and the CSL does not support the <em> tag. Hence, \emph get translated (during import) into <i> and vice versa. Therefore, <i> gets the semantics of \emph with all of its properties, most importantly, the case protection and emphasizing. That's my interpretation, so far. If I emphasize something in a text, my willing is that it should be rendered/printed as I typed it, independently, if I use plain LaTeX/BibLaTeX or Zotero to manage references. If something changes the basic characteristics of \emph in the middle, it gets confusing.

You sad:

But biblatex/biber is the primary use-case for BBT. I am taking what biblatex renders as the correct output, so if there's differences in rendering, I'd say that's a pandoc bug. In the biblatex toolchains they are protected, and I consider their behavior authoritative.

Hence, I suggest following the biblatex/biber assemptions/constraints for \emph and just keep the content of <i>...</i> as it is, as biblatex/biber also do with the content of \emph{...}. That will not introduce any new problems, as it currently does, and does not require any case changes or protection. Hence, do not apply title casing to <i>...</i>, instead simply convert it into \emph{...} as you have already mentioned. That is the simplest use case, requires no changes in Zotero and works for any use case except for pandoc-citeproc+mla.csl, which is probably a bug in pandoc-citeproc).

I use the title-caser from citeproc-js, so BBT capitalizes as would Zotero. No! The major difference between BBT and Zotero is that Zotero only capitalizes for uppercase styles like MLA, regardless of class='nocase' or class="nocase" (single or double quotation marks), otherwise not. BBT always capitalizes.

BTW2: If contacted Overleaf and ShareLatex and requested an update for BibLaTeX and Biber.

tolot27 commented 6 years ago

I got response from Overleaf:

Overleaf v2 runs the version of TeX Live 2017 as distributed in Ubuntu stable, and has biber 2.7 + biblatex 3.7. The \DeclareLanguageMapping also works in Overleaf v2: https://v2.overleaf.com/read/qqvrnsnypdzp

Hence, a MWE would look like:

\documentclass[british]{article}
\usepackage{babel}
\usepackage[style=apa,backend=biber,date=short]{biblatex}
%\usepackage[style=mla,backend=biber]{biblatex}
\DeclareLanguageMapping{british}{british-apa}
\addbibresource{\jobname.bib}
\begin{filecontents}{\jobname.bib}
  @article{c6,
    title = {\emph{un}protected emph: {\emph{Brucella melitensis}} (the outer braces \emph{un}protect from case meddling, so everything is now lowercase)},
    journaltitle = {MWE},
    date = {2006-05-08},
    author = {Walter06, M.}
  }
\end{filecontents}
\begin{document}
\nocite{*}
\printbibliography
\end{document}

Unfortunately, the c6 still renders differently in the most recent version of BibLaTeX 3.11 and Biber 2.11:

image

But that does not matter if you consider my previous comment.

retorquere commented 6 years ago

Maybe we should look from the other side, the LaTeX side and keep BBT and CSL out for a moment.

Much better, yes.

If we accept that \emph{Brucella melitensis} protects its content for case changes, why changing them into {{{\emph{Brucella melitensis}}} (1st case) or even {{{\emph{Brucella}}}}{\emph{ Melitensis}} (2nd case)?

BBT would translate <i class="nocase">Brucella Melitensis</i> to {{{\emph{Brucella Melitensis}}}. This is indeed equivalent to the simpler \emph{Brucella Melitensis}, but BBTs transformation uses only a very simple context manager to manage protection-unprotection, and the extra braces are an artifact from that. It means "Emphasize and case-protect Brucella Melitensis".

{{{\emph{Brucella}}}}{\emph{ Melitensis}} however means something entirely different than {{{\emph{Brucella Melitensis}}}. It means "Emphasize and case-protect Brucella, and then emphasize but don't case-protect Melitensis. The Zotero equivalent would be <i><span class="nocase">Brucella</span> melitensis</i>, but the inner span is inferred from the capital letter in Brucella.

If I were to have the title This is <i>Bart's paper</i>!, the output should not be This is \emph{Bart's paper}! or This is \emph{Bart's paper}! but This is {\emph{{Bart's} Paper}}! (with Paper title-cased, but not case-protected, because there is no indication that paper needs to be case-protected). In fact even Bart's is a little iffy here, but the assumption is that if you have a capital letter in a sentence-cased word, you want it there regardless of the style at play, so that word gets case protection.

No bib editor or processor does this and no human editor will do so.

BBT is of course not human, and needs to decide algorithmically how to transform the Zotero fields into BibLaTeX. The output is not always optimally pretty.

Also, BibLaTeX makes no assumptions of text or sentence case inside \emph{}. It just treats it as it is.

But that is the whole point. \emph{whatever} is not the equivalent of <i>whatever</i> but of <i class="nocase">whatever</i>. nocase means "treat it as it is". The lack of it means "don't treat it as it is". biblatex and Zotero have different semantics here.

The second case has even more disadvantages because {\emph{ Melitensis}} keeps now capitalized in MLA styles, which is not intended.

Wait what? MLA would case-change Melitensis but does not case-change {\emph{ Melitensis}}?

Furthermore, the splitting of the two words inside one \emph{} into two different \emph{} affects word spacing (microtyping) and line breaking.

Yes, this is true, and unfortunate, but I see no other way that I can do this given the current translator. {\emph{{Bart's} Paper}} would be better but is simply very hard to do; This is \emph{Bart's paper}! or This is \emph{Bart's Paper}! are just wrong translations.

Semantically, \emph means emphasize something whereas \textit means just formatting some words in italics. The default visual output of \emph is also italics but in nested environments like inside \textit, \emph changes its output to non-italics and therefore keeps it emphasized. For bibliography processing \emph is prefered over \textit because it keeps emphasizes also for book styles where the complete title is typically in italics. Nevertheless, \textit{} also protects its content from case changes and also makes no assumption of title case.

OK so at this point you've lost me. What are you trying to explain here? I know about emph and textit, but given that they exhibit the same behavior, what new information does this disclose that helps the conversation?

If we now consider both worlds, LaTeX and CSL

The markup that Zotero uses is not fully fledged HTML, it is restricted and in practice citeproc-js (which Zotero uses) interprets <i> as <em>, and I need to live by its rules. How HTML deals with casing (it doesn't) is not relevant here. How citeproc-js deals with casing is relevant because BBT intends to output biblatex that encapsulates intent as Zotero/citeproc inteprets it (and then look back at the Chicago sample, where Melitensis is not case-protected).

I also have very limited interest in CSL. If you want to use CSL, use BBT-CSL-JSON or BBT-CSL-YAML. BBT-biblatex means to output biblatex that respesents to the best of its abilities the intent as it was put into Zotero. If that doesn't work in CSL pipelines (who must parse the biblatex back into CSL objects by a by-necessity lossy procedure), that's a problem for those CSL pipelines, not BBT.

If I emphasize something in a text, my willing is that it should be rendered/printed as I typed it, independently, if I use plain LaTeX/BibLaTeX or Zotero to manage references.

But you are using Zotero. And Zotero simply does not do what you want here. Zotero does not render titles as you typed them merely because you emphasized it. The behavior is different from biblatex, but then, Zotero/citeproc-js makes a lot of different assumptions about how references should be rendered. Should the source be sentence or title-cased, does emph do case-protection yes or no, Zotero does this differently than biblatex. Zotero is not primarily a biblatex management system, it gets to make its own decisions. If you can convince Zotero/citeproc-js to do this differently I'll be happy to follow suit.

Hence, I suggest following the biblatex/biber assemptions/constraints for \emph and just keep the content of <i>...</i> as it is, as biblatex/biber also do with the content of \emph{...}.

There is a semantic difference here. Yes, <i> means something different than \emph{}, which is exactly why I am spitting out that ugly biblatex. I don't like it any better than you do, but until you can convince Zotero/citeproc-js to change the semantics of their use of <i>, this is just the way things are.

That will not introduce any new problems

It does, see the paper example. The Melitensis examply is misleading because we would both want it to be treated as a case-protected proper name, and the part that has an uppercase letter does get protection. Anywhere else, \emph{...} yields the wrong output. And even in Zotero you get the wrong output in Chicago if you don't apply explicit nocase protection.

No! The major difference between BBT and Zotero is that Zotero only capitalizes for uppercase styles like MLA, regardless of class='nocase' or class="nocase" (single or double quotation marks), otherwise not. BBT always capitalizes.

Yes, of course, because as I explained before, I need to transform to Title Case for biblatex to compensate for the sentence-case assumption in Zotero. BBT is not in the business of rendering references, but the businesses of outputting biblatex code that will render like it does in Zotero itself, or at least as close to it as is possible. That also means that I must make it possible to case-meddle paper. \emph{Bart's paper} prevents that.

I've just ran my MWE through overleaf v2 on both mla and apa, and given how <i> works in Zotero, the output is consistent with what I'd expect. My MWE is here, it's public, feel free to play, but please consider when setting your expectations that <i> does not imply case protection in Zotero as emph does in biblatex.

Perhaps we should take this to a chat environment like gitter. I feel like we're talking past each other here, and it can only escalate this way. I've also separately asked Nick Bart to join, he will be better able to translate your concerns to me -- he has experience with both how biblatex works/ought to work, and how Zotero works.

njbart commented 6 years ago

I fully agree with you, @retorquere. As to pandoc-citeproc, there is indeed an open ticket on what I think is the root cause of what @tolot27 is seeing when using pandoc in combination with a biblatex database.

retorquere commented 6 years ago

What about the potential kerning artifacts?

retorquere commented 6 years ago

Also I'm really trying to keep pandoc-citeproc out of this until we've established that BBT outputs the right things for biblatex/biber. The issue under contest is what <i> should output (right?)

tolot27 commented 6 years ago

Also I'm really trying to keep pandoc-citeproc out of this until we've established that BBT outputs the right things for BibLaTeX/biber. The issue under contest is what \ should output (right?)

Mostly yes, but more precisely its general meaning. If a bib file containing \emph{} gets imported, <i>...</i> gets produced and not <i class="nocase">...</i>. Hence, a roundtrip or a simple import of your MWE bib into Zotero produces for instance <i>Bart's</i><i> Paper</i> out of {{{\emph{Bart's}}}}{\emph{ Paper}}. IMHO, as long as Zotero does not fully support nocase during input, preview and output with its Add-Ons, I would prefer to interpret <i>...</i> as \emph{}, at least with a configuration option. At that point, no pandoc-citeproc interplays.

I also thought of an option which keeps <i>...</i> out of title casing and interprets it semantically identical to \emph{}.

BTW: I'm finally using Zotero for my reference management and export BibLaTeX to use it with pandoc to create docx output for reviewers (citeproc is necessary) and pandoc with BibLaTeX/Biber to get the final output for printing. Your suggestion to export to YAML works if I need only docx and a draft latex pdf.

BTW2: I signed it to gitter and I like BBT really and just want to support it further.

retorquere commented 6 years ago

This is a separate issue though. Importing bibtex is not the primary way to get references into Zotero, so if we see <i> that's not at all a given it got there by emph. And the issue we're discussing pertains to export what's in Zotero, not round-tripping.

BibTeX doesn't round-trip even close to cleanly in Zotero, and even BBTs parser (which is better than the zotero parser) gets this wrong. I also translate to plain italics because the un-protection rules requires a parser much more complicated than mine.

retorquere commented 6 years ago

(I'm not regularly on gitter, I just go there when I know there's a live debate going)

retorquere commented 6 years ago

The (from my pov) undesired semantic difference aside, interpreting <i> as emph means that non-protected emph becomes structurally impossible since zotero doesn't have a "docase" facility to override it.

retorquere commented 6 years ago

You could make the case to zotero to change the semantics of <i> and then I would follow suit.

But for your docx/latex needs, why not export the same references to yaml for docx, and bib for latex? That's one of the benefits of having the refs in Zotero. Otherwise why not use a bibtex-native ref mgr? Not that jabref is without issues.

tolot27 commented 6 years ago

That also means that I must make it possible to case-meddle paper in <i>Bart's paper</i>! I partially understand this requirement, but it is only necessary for styles like MLA.

Another point: As far as I know, every native bib reference manager imports italics text (written as <i>...</i>) from sources like PubMed as \emph{..} without any additional case protection/meddling. If someone needs to enforce a different case, it has to change it manually inside, but mostly it fits all needs. Only BBT behaves differently at that point. I really like BBTs sophisticated "title casing" except the <i>-case because I have to change every entry manually, which I don't have to do with other reference managers like JabRef. I would love to see a text transform feature which changes <i> to <i class="nocase"> for all selected entries in Zotero.

BTW: The current title case text transformation feature of Zotero (?) changes <i>Bart's paper</i> to <i>bart's Paper</i>. Applying it repeatedly does not change anything.

tolot27 commented 6 years ago

But for your docx/latex needs, why not export the same references to yaml for docx, and bib for latex? That's one of the benefits of having the refs in Zotero. Otherwise why not use a bibtex-native ref mgr? Not that jabref is without issues.

That's what I currently do, exporting both YAML and BibLaTeX and thanks to the autoupdate, it works really well. I tried so many reference managers and also contributed to JabRef with source code but finally decided to chose Zotero in combination with ZotFile and BBT. :smiley:

retorquere commented 6 years ago

I really want to separate import from export here. Absolutely nothing round-trips cleanly through zotero (my debug translator comes close but there are limitations in Zotero that make 100% round-trip impossible), and if other ref managers also get it wrong that's probably because the biber behavior is pretty strange. What's important is how what is in Zotero is treated on reference production. If you get zotero to change its stance on how <i> is interpreted I will follow suit.

Jabref is not afflicted by this because it really doesn't do anything at all with the references it manages. It's happy to export really broken bibtex - stick a bare % in a title to see what I mean - BBT has workarounds in place so it can make the most of the broken bibtex it is routinely offered by native managers. On the whole, native bib managers have an advantage when it comes to "producing" bibtex because they have no semantic translation to do. They just manage chunks of text tied to a key, and they're happy to ignore what's inside that chunk. I don't have that luxury. But then I tried I think pretty much all of them and wasn't happy with the experience. One thing I really like about z+bbt is that I can manage my entire library as a coherent whole, but can export targeted chunks for each paper. No 6000+ entry citation autocomplete for something I know will only stretch across 100. That, plus while it may not be pretty bibtex, it's never broken bibtex, so no rendering surprises caused by a stray bare %.

If you always want <i> to mean nocase, a simple BBT postscript will do this by just replacing all occurrences of <i>. I'm away from my computer right now but I can post one tonight.

The title casing feature in Zotero is pretty simple and doesn't handle a lot of cases that I do. I use citeprocs title caser, which is already better than Zotero's, but then I also help the citeproc title caser by hiding all markup from it, as the markup confuses it sometimes.

retorquere commented 6 years ago

this postscript should do it (untested though):

if (Translator.BetterBibLaTeX) {
  if (item.title) this.add({ name: 'title', value: item.title.replace(/<i>/g, '<i class="nocase">' });
}
retorquere commented 6 years ago

Did the postscript do what you wanted?

bensprung commented 5 years ago

Hi, I stumbled on this thread. I was having exactly the same problem as tolot27. For example, a journal article title that I have stored as

Competition between high and low mutating strains of <i>Escherichia coli</i>

Is supposed to come out on the other end (I am using Rstudio and Rmarkdown and going to PDF) as

Competition between high and low mutating strains of Escherichia coli

Nothing I tried (hidden settings etc) was helping, but the postscript works! Thank you for posting that. BTW there is a missing ) towards the end of the second line, should be )});

mikoontz commented 5 years ago

I also found this issue because of the same use case. I have lots of species names in my citations and I want to be able to italicize them properly. I tried hard to understand this thread, but came up short. So I'm still not clear on why <i>Dendroctonus brevicomis</i> should ever be exported to the .bib file with a capital 'b' in brevicomis. I'm suspending disbelief that this is desirable behavior! Thanks so much for all your work on BBT, and thanks for the workarounds.

For those like me that were novices to the postscripting capabilities of BBT, here's a relevant link: https://retorque.re/zotero-better-bibtex/exporting/scripting/

For easy copypasting, this code snippet incorporates the correction by @bensprung to @retorquere's postscript:

if (Translator.BetterBibLaTeX) {
  if (item.title) this.add({ name: 'title', value: item.title.replace(/<i>/g, '<i class="nocase">' )});
}
retorquere commented 5 years ago

So I'm still not clear on why Dendroctonus brevicomis should ever be exported to the .bib file with a capital 'b' in brevicomis.

Neither Zotero nor BBT does any kind of natural language processing -- if you want stuff to be marked as proper name and therefore excluded from case meddling, you need to mark it as such. <i>Dendroctonus brevicomis</i> is exported capitalized because brevicomis is not one of the words excluded by the Zotero capitalizer (BBT uses the Zotero capitalizer), and in general, BBT must capitalize titles.

You can turn this off if you are willing to accept the collateral damage that comes with it (see that FAQ), but for this specific case, Zotero would also capitalize brevicomis when generating the bibliography. The proper, Zotero-supported solution is to enter your title in Zotero as

Competition between high and low mutating strains of <i><span class="nocase">Escherichia coli</span></i>

-- that the postscript above works is a quirk from BBT, which accepts nocase on any element.

If Zotero changes its behavior so that <i> implies <i><span class="nocase">, it would be trivial for me to follow suit, but in Zotero, <i> does not mean <i><span class="nocase">, so I don't treat it as if it were. If you convince Zotero to change this, or to add a way to mark proper names, I'd be happy to apply that to my exports.

bensprung commented 5 years ago

if you want stuff to be marked as proper name and therefore excluded from case meddling

Would be great if Zotero would maintain a user-editable list of proper names that are to be so protected. Any chance BBT could do that instead? Most of us who have Zotero libraries full of papers with species names probably don't really have that many names to deal with.

retorquere commented 5 years ago

Would be great if Zotero would maintain a user-editable list of proper names that are to be so protected.

You'd have to talk to the devs who work on the Zotero client (Dan Stillman & co) and those who work on the citation processor (Frank Bennett & co) -- the former if you want an UI for it (less important) and the latter if you want support for it in the processor (this is key). If it gets implemented in citeproc, it'd either be automatically supported in BBT, or it'd be trivial to add.

Any chance BBT could do that instead? Most of us who have Zotero libraries full of papers with species names probably don't really have that many names to deal with.

No, sorry. You can do this in a postscript, but BBT follows the Zotero behavior here, and specifically the case meddling I'm very happy to re-use the citeproc titlecaser -- word detection uses some fairly gnarly regex-based scanning (because natural language processing here would be a nightmare) and if I'm going to deviate from what Zotero itself does it's going to get messy fast. Messy means harder for me to support.

This postscript may work (untested); you can add as many terms to the regex as you want.

if (Translator.BetterTeX) {
  if (item.title) this.add({ name: 'title', value: item.title.replace(/\w(coli|melitensis)\w/g, '<span class="nocase">$1</span>' )});
}
mikoontz commented 5 years ago

Huge thanks for all of this! And I think it is making some sense to me why it works the way it works. Adding the additional 'nocase' for each <i></i> or <b></b> works for me! I was able to do this manually upon importing a new reference and with the postscript code snippet.

retorquere commented 5 years ago

You're very welcome -- I don't mean to be flippant about this, but proper name detection is just way too hard to do and there's bound to be a bazillion edge cases.

github-actions[bot] commented 3 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.