retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.35k stars 288 forks source link

Capitalization: Capitalize all title-fields for language "en" #383

Closed retorquere closed 8 years ago

retorquere commented 9 years ago

@nickbart1980 says:

BBT should convert all titles to title-case if the ‘Language’ field is empty or starts with ‘en’, excluding, however, skip words, and strings enclosed in <span class="nocase">…<span>. ‘All titles’ means title, volume-title, container-title, collection-title, including their ‘short’ forms. Titles in entries with a non-empty ‘Language’ field that does not start with ‘en’ should be left alone (see the notes on \MakeSentenceCase, biblatex manual 4.6.4, and compare the man page of pandoc-citeproc, which has to do the inverse conversion when using a biblatex database – as would, BTW, any import of bib(la)tex into Zotero). For bibtex, which does not have a langid field and thus cannot distinguish languages, I would guess that the complete title fields of non-English titles should be wrapped in braces to prevent bibtex from messing with capitalisation.

njbart commented 9 years ago

Wait, you are saying DeclareCaseLangs is not fixed. Should I stick those languages in a preference so the translator can match?

I think that’s unnecessary. English is the only language that has case conversion. I’d be very surprised if anyone actually ever redefined \DeclareCaseLangs.

njbart commented 9 years ago

So, on the \emph, what is the preference?

\emph{Homo sapiens} or, if this it not feasible: {{\emph{Homo sapiens}}} – NOT {\emph{Homo sapiens}}

More complex cases such as

<span class="nocase"><i>Sambucus nigra</i> subsp. <i>canadensis</i></span>

would have to be mapped to either \emph{Sambucus nigra} {subsp.} \emph{canadensis}

or {{\emph{Sambucus nigra} subsp. \emph{canadensis}}}

Or do you see any other options?

njbart commented 9 years ago

title = {From {Bell}’S {Theorem} to {Secure Quantum Key Distribution}}, … how to prevent this?

Not sure. Treat as part of the word? Also, how do the citeproc-js routines for converting to title-case work (they must have some solution for this), and could BBT possibly borrow these?

retorquere commented 9 years ago

I've looked into that, but they look to assume a substantial amount of internal state. I'm building a word parser from this right now.

WRT to the \emph{} case, the reverse case is indeed more complex. It would require lookahead and that's not a level of complexity I'm looking forward to. adding a double-brace to protect would be the easiest way by far.

retorquere commented 9 years ago

Argh, and then there is o'neal, where you do want to capitalize both.

Crazy.

retorquere commented 9 years ago

I managed to tap into the CSL tittecaser (I think!), so we'll see how that goes. Not super enthusiastic about changing peoples' capitalization, so the title casing will definitely be behind a off-by-default preference.

retorquere commented 9 years ago

It's starting to look fairly decent. Only titles are cased right now, I'd appreciate a list of fields that need this treatment (and help getting the test files updated for the new behavior). The title casing is a little fragile as the CSL titlecaser (sensibly) expects to be handed whole sentences, and I'm handing it fragments as I deal with the embedded HTML.

njbart commented 9 years ago

… I'm handing it fragments …

But since Zotero fields may contain embedded <span class="nocase">…<span>, <i>…</i>, <b>…</b>, etc. anyway, I would have expected the CSL titlecaser to be able to handle this.

… a list of fields that need this treatment …

title, container-title (except in Journal Article, Magazine Article, Newspaper Article), volume-title; not collection-title.

… getting the test files updated …

I’m afraid right now I’m still too busy with testing edge cases in biblatex (see e.g. https://tex.stackexchange.com/questions/276943/biblatex-how-to-to-emphasize-but-not-caps-protect; documentation is a complete mess).

retorquere commented 9 years ago

The titlecaser doesn't uppercase anything inside HTML tags. If that's OK, I'm fine with that (it would simplify things), but it doesn't seem right.

You phrase applicability in CSL-JSON terms -- that follows the mapping behavior already present? volume-title doesn't always map to the same bibtex field, and the fields it maps to can be generated by other means. Can you specify this behavior further?

retorquere commented 9 years ago

I’m afraid right now I’m still too busy with testing edge cases in biblatex (see e.g. https://tex.stackexchange.com/questions/276943/biblatex-how-to-to-emphasize-but-not-caps-protect; documentation is a complete mess).

which is a much more valuable way to spend your time. Forget about the test cases.

njbart commented 9 years ago

The titlecaser doesn't uppercase anything inside HTML tags.

Well, it seems not to uppercase the first word; see bug report at https://bitbucket.org/fbennett/citeproc-js/issues/187/title-case-formatter-does-not-title-case

retorquere commented 9 years ago

I bet that's more breakage from it starting a fresh state after an HTML tag. I work around that with some success.

njbart commented 9 years ago

Can you specify this behavior further?

In biblatex terms: title, shorttitle, origtitle, booktitle, maintitle; not journaltitle, series, and eventtitle.

Since BBT currently does not output any subtitles, titleaddons, reprinttitle, issuetitle or indextitle, none of these are relevant.

retorquere commented 9 years ago

Ah sweet, that makes things a lot clearer.

retorquere commented 9 years ago

Done, tests are running.

retorquere commented 9 years ago

When we're done, Zotero is going to have the best damn BibTeX support short of JabRef. And that includes the commercial offerings Zotero is usually compared against.

njbart commented 9 years ago

You bet!

retorquere commented 9 years ago

Should journaltitle be caps-preserved? Or do caps preservation and titlecasing always and exclusively co-occur on the same set of fields?

njbart commented 9 years ago

Should journaltitle be caps-preserved?

No. Traditionally, journal titles are in title case and never change – and, more important for BBT, styles don’t try to change them.

Or do caps preservation and titlecasing always and exclusively co-occur on the same set of fields?

Yes.

retorquere commented 9 years ago

OK, so I can just collapse those two behaviors and exclusively apply them to itle, shorttitle, origtitle, booktitle, maintitle; no more, no less.

retorquere commented 9 years ago

(this is important to know for sure as I'm about to dive in and start adjusting test cases)

njbart commented 9 years ago

None of the CSL or biblatex styles I ever came across fiddles with the case of journal names or series – so, yes.

(If you don’t want to just take my word on this – adamsmith, Oct 2015: “I've never seen sentence cased journal titles in any citation …”, here.)

retorquere commented 9 years ago

Note to self: don't break bibvar preservation

retorquere commented 9 years ago

@nickbart1980 So we're just going with always double-brace for nocase?

njbart commented 9 years ago

I’m afraid so – not pretty, but apparently the only format that works regardless of what’s inside the braces. Alternative: use double-braces only if ‘argument’ starts with \.

retorquere commented 9 years ago

Working on adjusting the tests.

retorquere commented 9 years ago

Does this mean BTW that {\emph{...}} should be preferred over \emph{...} for <i>...</i>?

njbart commented 9 years ago

Good idea – counterintuitive, but I guess this should work for emphasized but non-caps-protected strings.

See for yourself:

\documentclass[american]{article}
\usepackage[american]{babel}
\usepackage[autostyle]{csquotes}%
\usepackage[backend=biber, style=apa]{biblatex}
\DeclareLanguageMapping{american}{american-apa}
\usepackage{fontspec}
\setmainfont{Linux Libertine}
\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}

@article{a,
  author       = {Doe, John},
  year         = {2015},
  title        = "Any {Foo} that appears uppercase is protected – 
                 \emph{Foo}, {\emph{Foo}}, {{\emph{Foo}}}"
}
\end{filecontents}
%
\addbibresource{\jobname.bib}
\begin{document}
\cite{a}
\printbibliography
\end{document}

Output:

Doe, J. (2015). Any Foo that appears uppercase is protected – Foo, foo, Foo.

retorquere commented 9 years ago

wait, {\emph{Foo}} doesn't exactly do what I had expected here. Why would we want that form?

retorquere commented 9 years ago

Ah, got it -- we want <i>...</i> not to trigger protection, so {\emph{...}} is the right behavior. Does this also go for textbf, textsuperscript, textsubscript, enquote and textsc?

njbart commented 9 years ago

wait, {\emph{Foo}} doesn't exactly do what I had expected here. Why would we want that form?

For emphasized but non-caps-protected strings, i.e., <i>...</i>.

See my enquiry at https://tex.stackexchange.com/a/277170/22851 whether this is indeed a general solution.

Does this also go for textbf, textsuperscript, textsubscript, enquote and textsc?

\textbf: I tested this, and: yes. – Will try the others …

njbart commented 9 years ago

Ok, \textbf, \textsuperscript, \textsubscript, \enquote and \textsc all show the same behaviour. See for yourself: Try any of these commands in my MWE above.

njbart commented 9 years ago

BTW: The citeproc-js issue “Title case formatter does not title-case first word inside markup” (here) has been resolved now, so possibly the workaround you mentioned earlier is no longer required.

retorquere commented 9 years ago

Thanks for the heads-up -- workaround removed.

retorquere commented 9 years ago

(mental note -- un-chunk titlecaser)

njbart commented 9 years ago

Got a response from a biblatex developer: {\emph{Foo}} being parsed as emphasized but non-caps-protected string, and \emph{Foo} and {{\emph{Foo}}} as emphasized and caps-protected strings apparently is the expected behavior:

“This is just one of the small differences between bibtex the program and the btparse library used by biber.” (https://github.com/plk/biblatex/issues/357#issuecomment-154819866)

retorquere commented 9 years ago

alright, then the current approach is the proper one.

njbart commented 9 years ago

FWIW, the btparse library biber/biblatex are using is described in

Ward, Gregory P. 1998. “btOOL: An Object-Oriented Library for Processing BibTeX-Style Text Databases.” Montreal: McGill University, School of Computer Science. https://gerg.ca/software/btOOL/btOOL.ps.gz.

In particular, see p. 58–9 on bt_change_case(): “The right solution (and this applies to any title with a TeX command that becomes actual text) is to bury the control sequence at brace-depth two: A Guide to {{\LaTeXe}}: Document Preparation ...

retorquere commented 9 years ago

Cool. The current converter only does exactly this - at deeper levels it knows they've already been applied and doesn't do it again.

The main hurdle right now is title casing. If I can get that right, I think it's good to go.

retorquere commented 9 years ago

Damn... most of it is working now (barring a final issue in the title caser) but it is slow. Like 250% slower. I'll need to look where that happened, because this really isn't OK.

retorquere commented 9 years ago

OK, all tests green (yay!) but performance is unacceptable. Still looking at possible hotspots, but caps preservation is a major contributor. Haven't yet tried with titleCasing on, as that currently errors out in the performance test.

retorquere commented 9 years ago

(for clarity: the current slowness isn't attributable to the titlecaser, its my caps preservation)

retorquere commented 9 years ago

OK, managed to whittle performance back to where it was. Still looking at the title casing -- the CSL title caser doesn't yet handle everything gracefully. Does the following sum op titlecasing correctly, assuming the source is in sentence case?

  1. A word is anything that starts with a letter after (^|\s)(<punctuation>*)
  2. If the word appears in a list of "small words", and it is not preceded by :<space>* (each time? first time?), downcase the first letter
  3. If it does not, upcase the first letter
njbart commented 9 years ago

I’m afraid I can’t give a definitive answer, partly because I’m not sure I fully understand your notation. All in all, I’d say citeproc-js’s titlecaser does a good job – if it does something unexpected, it’d be interesting to see the example.

Two things to keep in mind: (1) AFAICT, the titlecaser never actively downcases anything.

(2) The first letter of a subtitle is capitalised in some styles but not in others. citeproc-js has a heuristic to identify the subtitle, which is, AFAIR, to compare title and short title, and, if the initial part of the title matches the subtitle, assume that the rest of title, i.e., title minus short title, is the subtitle. citeproc-js then creates “virtual” variables title-main and title-sub (see http://sourceforge.net/p/xbiblio/mailman/message/32056473/). In addition to that, there is a citeproc-js processor plugin (https://juris-m.github.io/downloads/, look for “Propachi Upper”) that controls whether the first letter of a subtitle should be capitalised. BBT, I guess, could again borrow this, and output biblatex title and subtitle fields. BBT might also offer an option for capitalising the first letter of a subtitle (the Zotero/CSL recommendation is to enter subtitles starting with a lowercase letter).

retorquere commented 9 years ago

The notation is sort-of-regex, in normal english:

  1. A word-start is any letter that follows a space, but there may be punctuation between them, so <space>XYZ's, <space>"XYZ's" would consider the Xs to be word-starts, but not the s, since the s doesn't have a preceding <space>.
  2. if the word appears in a list of small words, leave it alone, unless it is preceded by :<space>
  3. Uppercase the word-start

I've submitted a sample in an issue report for citeproc-js, as it currently errors out if I feed it that.

WRT (2), I can't use title matching currently since I'm not feeding it the whole reference, just the field. And I can't assume any particular style in play; I'm targeting what Bib(La)TeX wants to see, which must be before any decision is made on render style.

I'm more than happy to use the CSL titlecaser if it works (it doesn't seem like an easy problem, and I lack any expertise in the domain), but I'm abusing their API (it doesn't expect to be fed just parts of references), and it does some double work (both citeproc and BBT are traversing the HTML string). If there is an easy way to fold those two together it will likely have positive performance impact, and BBT has gotten sufficiently complex that I need to tread lightly here. The cache helps enormously, but I know for example users whose library already takes 40-60 minutes to export with a cold cache, and I'd like to at least not worsen that (an output change has traditionally triggered a cache drop -- I'm going to make that optional because of this).

In any case, Frank has been super responsive so I'll just wait this one out.

retorquere commented 8 years ago

Is it desirable to caps-protect words in English titles that are already in Initialcaps? The titlecase would have changed their sentence case form to the titlecase (Initialcaps) form anyhow.

njbart commented 8 years ago

Yes, of course.

Zotero’s Why is Apple launching a new version of the iPod?

must be converted to

biblatex Why Is {Apple} Launching a New Version of the {iPod}?

(Or else biblatex sentence-case styles would render the unprotected Why Is Apple Launching a New Version of the iPod? as Why is apple launching a new version of the ipod?)

retorquere commented 8 years ago

But would Why Is Apple Launching a New Version of the {{iPod}}? not do the same thing?

retorquere commented 8 years ago

(I thought we had settled on Why Is {{Apple}} Launching a New Version of the {{iPod}}??)

njbart commented 8 years ago

But would Why Is Apple Launching a New Version of the {iPod}? not do the same thing?

That’s how biblatex’s (and bibtex’s!) conversion from title case to sentence case works. Just try my biblatex-apa MWE above.

(I thought we had settled on Why Is {{Apple}} Launching a New Version of the {{iPod}}??)

Right, that works, too, and is needed when the string inside starts with a \, and I don’t see problems if we use this across the board.