retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.21k stars 285 forks source link

Capitalization: Capitalize all title-fields for language "en" #383

Closed retorquere closed 8 years ago

retorquere commented 8 years ago

@nickbart1980 says:

BBT should convert all titles to title-case if the ‘Language’ field is empty or starts with ‘en’, excluding, however, skip words, and strings enclosed in <span class="nocase">…<span>. ‘All titles’ means title, volume-title, container-title, collection-title, including their ‘short’ forms. Titles in entries with a non-empty ‘Language’ field that does not start with ‘en’ should be left alone (see the notes on \MakeSentenceCase, biblatex manual 4.6.4, and compare the man page of pandoc-citeproc, which has to do the inverse conversion when using a biblatex database – as would, BTW, any import of bib(la)tex into Zotero). For bibtex, which does not have a langid field and thus cannot distinguish languages, I would guess that the complete title fields of non-English titles should be wrapped in braces to prevent bibtex from messing with capitalisation.

retorquere commented 8 years ago

Do you mean for CSL JSON or for BBT? I'm not entirely certain about capitalizing user input; my idea is that BBT discloses user intent as best as possible given the impedance mismatch between the formats. User intent for capitalization is, I think, best expressed by the user capitalizing titles as desired.

njbart commented 8 years ago

If users enter On the prosodies of the Greek and Latin languages in Zotero, this is rendered as “On the Prosodies of the Greek and Latin Languages” in a title-case style, e.g., Chicago, and as “On the prosodies of the Greek and Latin languages” in a sentence-case style, e.g., APA.

To get the same in bibtex and biblatex, there is no other option than to convert the title to On the Prosodies of the {Greek} and {Latin} Languages; this is the only way to have it rendered as “On the Prosodies of the Greek and Latin Languages” in Chicago, and as “On the prosodies of the Greek and Latin languages” in APA.

This, I would argue, respects user intent as best as possible.

retorquere commented 8 years ago

Interesting. Which processor renders it that way? Not BibTex then.

I'm still not entirely convinced. Adding braces around {Greek} is the only way to disclose to LaTeX you want to keep capitalisation. The easiest way to disclose that you want to have a certain capitalisation, but only if the style demands it, is to Capitalise the Source Sentence.

If you input On the Prosodies of the Greek and Latin Languages, does the processor you have in mind do the right thing when a style that does not do title-casing?

retorquere commented 8 years ago

BTW, how should this interact with caps preservation? Surely you wouldn't want On the prosodies of the Greek and Latin languages to be translated to On the {Prosodies} of the {Greek} and {Latin} {Languages}?

retorquere commented 8 years ago

Or would you want non-capitalised non-filler words to be capitalised, and capitalised non-filler words to be braced? How about something like iPod? This would be capitalised in this scheme. I'm not too keen on the <span class="nocase">…<span> workaround. It seems easier to provide a "capitalise this title" function to Zotero to just fix the input (assuming such can easily be done).

retorquere commented 8 years ago

Ugh, I can't add things to the reference edit pane without some crazy shady monkey patching. That is going to be too brittle. On the whole reference is not a problem though.

retorquere commented 8 years ago

Why specifically for english though? Doesn't this apply to other languages equally?

retorquere commented 8 years ago

I have some ideas on how to get this to work, but I'll probably put it behind a preference

retorquere commented 8 years ago

Is the list of fields that should be capitalised the same as the list that should get preserve caps?

adunning commented 8 years ago

To clarify your earlier question, this doesn't need to be applied to CSL JSON, since citeproc handles the capitalization already; Zotero recommends that all titles be stored sentence-case.

retorquere commented 8 years ago

That was my earlier point actually. Why not have the user store the titles sentence-cased in the first place?

adunning commented 8 years ago

That is the official recommendation.

retorquere commented 8 years ago

I've taken a few stabs at it but it gets increasingly messy and fragile. I'm sorry, but I'm not going to honor this one.

njbart commented 8 years ago

Why specifically for english though? Doesn't this apply to other languages equally?

No, only English has both title-case and sentence-case styles.

njbart commented 8 years ago

Or would you want non-capitalised non-filler words to be capitalised, and capitalised non-filler words to be braced?

Exactly.

How about something like iPod? This would be capitalised in this scheme.

EDIT: “iPod” shouldn’t be capitalised by BBT, and it should be protected.

njbart commented 8 years ago

Is the list of fields that should be capitalised the same as the list that should get preserve caps?

Yes. bib(la)tex needs titles in title case, and those words that must not be lowercased again by sentence-case styles such as biblatex-apa need protection.

njbart commented 8 years ago

I'm sorry, but I'm not going to honor this one.

That’d be a pity. It’s necessary since the conventions of bib(la)tex and CSL are incompatible: bib(la)tex expects titles in title-case, and words that must not be lowercased must be protected, but CSL expects titles in sentence-case, and words that must not be uppercased must be protected. (The latter doesn’t happen so very often, but without protection CSL title-case styles would turn, e.g., “nm” (nanometer) into “Nm” (Newtonmeter), something that should really be avoided.)

<span class="nocase">…<span>, BTW, is officially supported by citeproc-js and pandoc-citeproc.

Over at pandoc, we’ve been through this whole exercise when writing pandoc-citeproc’s biblatex -> CSL converter (the inverse of what I’d like BBT to do), but it’s not that complicated after all, and seems to work great.

retorquere commented 8 years ago

But how would I know that iPod should be excluded from capitalization? And why would it not be better to assume title-case and convert title-case to sentence-case for CSL? That seems to be a lot simpler to me.

retorquere commented 8 years ago

This one is not going to be easy. It will require rethinking of the way I convert the HTML-ish input to LaTeX.

njbart commented 8 years ago

But how would I know that iPod should be excluded from capitalization?

All words in camel case should be excluded.

And why would it not be better to assume title-case and convert title-case to sentence-case for CSL? That seems to be a lot simpler to me.

The CSL folks insist storing titles in sentence case is simpler: From http://docs.citationstyles.org/en/stable/specification.html#sentence-case-conversion:

CSL processors don’t recognize proper nouns. As a result, strings in sentence case can be accurately converted to title case, but not vice versa.[*] For this reason, it is generally preferable to store strings such as titles in sentence case, and only use [in CSL style files the] text-case [attribute] if a style desires another case.

See also https://www.zotero.org/support/kb/sentence_casing

EDIT: [*] This statement, BTW, is just as false as if biblatex claimed “strings in title case can be accurately converted to sentence case, but not vice versa” (though biblatex claims no such thing). In truth, both need (and processors for both have) some mechanism for protecting strings from being converted.

retorquere commented 8 years ago

The problem here is with the term "words". So far, BBT has done char-for-char translation so far to escape such problems:

Caps preservation runs after all these problems have been evaded by just scanning through the resulting LaTeX string, but that won't do, because prosodies would be offered to the caps preserver capitalized, and would thus be protected.

I'm not saying it can't be done (it can) or that it shouldn't (I'm open to it), but it's a complicated matter in which I'd have to spec out the entire transformation process of Zotero input to LaTeX output, as the existing simpler character transformation pipe can't just easily be adjusted to do the work, and it's sufficiently complex that I can't just start tinkering until I get it right.

andersjohansson commented 8 years ago

Ahh, this is a mess! Sometimes using LaTeX, sometimes using the Libreoffice plugin, storing things correctly in Zotero and getting a correct output with all tools and different styles becomes difficult (though I must admit that I have mostly used APA now). CSL will hopefully "soon" (http://xbiblio-devel.2463403.n2.nabble.com/Sentence-case-variants-td7578968.html) support uppercasing subtitles, so the recommendation would "soon" be to store titles as "Writing a good title according to Johansson: the importance of subtitles". Then CSL would keep it as it for Vancouver-styles, capitalize "the" for APA-styles ("Writing a good title according to Johansson: The importance of subtitles"), and title-case it for Chicago-styles: ("Writing a Good Title According to Johansson: The Importance of Subtitles").

But before this is implemented, I store my titles as: "Writing a good title according to Johansson: The importance of subtitles".

The desired biblatex-form for this (very simple) example should then be "Writing a Good Title According to {Johansson}: The Importance of Subtitles". The "the" should not be protected (to allow for Vancouver-styles), which it of course is if I store it as uppercase as now.

Perhaps I should just write and cite in Swedish where there is kind of one rule (title-case doesn't exist).

Some of the discussions: https://forums.zotero.org/discussion/35190/1/beta-capitalization-after-colons/ https://forums.zotero.org/discussion/45811/titlesubtitle-gui/

njbart commented 8 years ago

CSL will hopefully "soon" … support uppercasing subtitles

By “uppercasing subtitles”, you mean “uppercasing the first character of the subtitle”, right? – For Zotero, there is a citeproc-js processor plugin that does this: https://juris-m.github.io/downloads/, look for “Propachi Upper”.

I’m not sure OTTOMH how biblatex handles this, but not protecting the first character of the subtitle is most likely a good idea.

retorquere commented 8 years ago

with subtitle, you mean the stuff after the colon?

Can we start fleshing out an algorithm and test cases? I've thought about this long and hard, and I've come to the conclusion that the previous staged approach (capitalize, convert to LaTeX char-by-char, then protect) won't work, because:

  1. If I do it naively, I can't discern between chars that were uppercase to begin with, and those that have been capitalized by step one, when I get to step 3
  2. There is no trivial way to relate characters to their original position, so I can't just "look back" to see if a certain char in step 3 was lowercase in step 1; characters expand to multiple (when expanding unicode to LaTeX for example), but also stuff like <i>...</i> messes with it, and the notion of "character" in relation with "position" is problematic to begin with because of the wonderful mess that is javascript unicode handling

So from the looks of it, I will be writing a backtracking recursive descent parser for this by hand. This is a major undertaking, so having a good idea of the algorithm and test cases will be crucial. Way too easy to get this wrong.

njbart commented 8 years ago

Why not protect first, then capitalize & convert to LaTeX char-by-char?

retorquere commented 8 years ago

Protection consists of putting braces around words; if I do that in step one, they'll be converted to \{ in step 3.

retorquere commented 8 years ago

Plus, if I inject anything before capitalization, I will likely mess with the "wordiness" of the string of characters, which in turn interferes with capitalization.

njbart commented 8 years ago

Protection consists of putting braces around words …

I’d use something else in this first step. The title may already contain <span class="nocase">…</span>; maybe this could be used to protect all other uppercase and camelcase words, too. The last step, after capitalization & conversion, would be to replace <span class="nocase">…</span> by {…}.

retorquere commented 8 years ago

I'll think about this. It requires a rewrite of the HTML converter, but it's less tricky than a rec-desc parser for sure. Test cases would still be of paramount importance, as I can't rely on the existing test suite -- things will legitimately change, so I'll need a reference set for what is the correct behavior.

andersjohansson commented 8 years ago

By “uppercasing subtitles”, you mean “uppercasing the first character of the subtitle”, right? Yes, of course, sloppy writing.

with subtitle, you mean the stuff after the colon? Yes!

Thanks for thinking through this. I wonder if someone has put up a list of "tricky cases" in the Zotero/CSL-discussions.

retorquere commented 8 years ago

It's just that I seem to recall a discussion where the comma was considered to be the subtitle separator.

In any case, what exactly ought to be done with a subtitle?

andersjohansson commented 8 years ago

As I understand it, Zotero users should (at least when CSL supports this properly) store everything except the first word and proper nouns as lowercase, i.e: "Writing a good title according to Johansson: the importance of subtitles". Then biblatex (I don't know about bibtex) should get this title-cased, with capitalized words beyond the first word protected: "Writing a Good Title According to {Johansson}: The Importance of Subtitles".

In this case subtitles doesn't have to be considered explicitly at all as far as I can tell. But in the case of other conventions for user data... Well, it gets messy, perhaps nothing should be done? I guess any capitalization done by betterbibtex should be switchable via some option.

retorquere commented 8 years ago

You'd better believe it when it comes to the latter.

Could you start putting in a test case for the case you just mentioned? It would at least give me an initial target

njbart commented 8 years ago

A few test cases: CVPEAV8P

Expected biblatex output:

@book{-estimateur,
  title = {Estimateur d'un défaut de fonctionnement d'un modulateur en quadrature et étage de modulation l'utilisant},
  langid = {french},
  note = {in French, so no case conversion, nor protection.}
}

@book{-stochastic,
  title = {A Stochastic Model of {TCP Reno} Congestion Avoidance and Control},
  langid = {american}
}

@book{-remarks,
  title = {Some Remarks on {’t Hooft’s} {S}-Matrix for Black Holes},
  langid = {american},
  note = {Zotero’s “Language” field empty, so “en-US” is assumed.}
}

@book{-alkanethiolate,
  title = {Alkanethiolate Gold Cluster Molecules with Core Diameters from 1.5 to 5.2~{nm}},
  langid = {american}
}

@book{-highspeed,
  title = {High-Speed Digital-to-{RF} Converter},
  langid = {american}
}

@article{-effect,
  title = {Effect of Immobilization on Catalytic Characteristics of Saturated {Pd-N}-Heterocyclic Carbenes in {Mizoroki-Heck} Reactions},
  langid = {american}
}

@article{-carbocyclic,
  title = {A Carbocyclic Carbene as an Efficient Catalyst Ligand for {C}–{C} Coupling Reactions},
  langid = {british}
}

@article{-pleistocene,
  title = {Pleistocene {\emph{Homo sapiens}} from {Middle Awash}, {Ethiopia}},
  langid = {british}
}
retorquere commented 8 years ago

Super, that should give me a start. That last case is going to be

@article{-pleistocene,
  title = {Pleistocene {\emph{Homo sapiens}} from {Middle} {Awash}, {Ethiopia}},
  langid = {british}
}

BTW. Grouping together protected words adds a new layer of complexity.

retorquere commented 8 years ago

Do we still need no/inner/all, or can it just be a binary yes/no given the new scheme?

njbart commented 8 years ago

There is only one way of entering titles in bibtex and biblatex databases that ensures correct output with both title-case styles (e.g., Chicago) and sentence-case styles (e.g., APA), and this is entering them in title-case, with proper names and other strings that must not undergo case conversion wrapped in braces for protection.

Thus I would say we do not need an option here at all.

retorquere commented 8 years ago

there is no non-breaking space in the source of 5.2~{nm} -- do you want it in the source (then please add it by editing it here -- sorry, I am clueless about how to enter unicode), or should I adjust the bibtex?

retorquere commented 8 years ago

In the "remarks" entry there is no language specified -- in which case I don't currently generate one. Is it proper to blithely generate "american" in its absence?

retorquere commented 8 years ago

Would you prefer {\emph{Homo sapiens}} or \emph{Homo sapiens}? From my understanding, both would achieve the same.

retorquere commented 8 years ago

https://travis-ci.org/ZotPlus/zotero-better-bibtex/builds/89329611

retorquere commented 8 years ago

The hint for using <span class="nocase">…<span> in phase 1 seems to have done it. I think it's actually sort of done.

retorquere commented 8 years ago

I could seriously use a hand in adjusting the test cases for the mismatches from the new behavior

retorquere commented 8 years ago

Note that there are genuine errors in there that I need to fix, but the bulk just needs to be updated.

retorquere commented 8 years ago

Oh and for now that only concerns the BibLaTeX tests. Do we have a resolution yet for how to do the whole caps preservation for BibTeX? And what fields should be titlecased? Currently it's only title.

njbart commented 8 years ago

In the "remarks" entry there is no language specified -- in which case I don't currently generate one. Is it proper to blithely generate "american" in its absence?

I thought so, but no. Biblatex manual: “\MakeSentenceCase{⟨text⟩} … It only converts the ⟨text⟩ to sentence case if the langid field is undefined or if it holds a language declared with \DeclareCaseLangs … By default, american, british, canadian, english, australian, newzealand as well as the aliases USenglish and UKenglish.”

there is no non-breaking space in the source of 5.2~{nm}

It’s there, in edit mode represented by a small centered dot.

Would you prefer {\emph{Homo sapiens}} or \emph{Homo sapiens}? From my understanding, both would achieve the same.

\emph{Homo sapiens}. – {\emph{Homo sapiens}} actually does not protect at all (tested with biblatex-apa); it seems a command or even just a backslash after the opening brace removes the protection. To really force protection, {{\emph{Homo sapiens}}} seems to work.

retorquere commented 8 years ago

Ugh, javascript and unicode. The NBS is there, but it's encoded differently than I had expected. I will figure that one out.

So, on the \emph, what is the preference?

retorquere commented 8 years ago

Wait, you are saying DeclareCaseLangs is not fixed. Should I stick those languages in a preference so the translator can match?

retorquere commented 8 years ago

OK, I think I got the NBS.

retorquere commented 8 years ago

Crap:

- title = {From Bell’s Theorem to Secure Quantum Key Distribution},
+  title = {From {Bell}’S {Theorem} to {Secure Quantum Key Distribution}},

how to prevent this?