Closed retorquere closed 8 years ago
Do you mean for CSL JSON or for BBT? I'm not entirely certain about capitalizing user input; my idea is that BBT discloses user intent as best as possible given the impedance mismatch between the formats. User intent for capitalization is, I think, best expressed by the user capitalizing titles as desired.
If users enter On the prosodies of the Greek and Latin languages
in Zotero, this is rendered as “On the Prosodies of the Greek and Latin Languages” in a title-case style, e.g., Chicago, and as “On the prosodies of the Greek and Latin languages” in a sentence-case style, e.g., APA.
To get the same in bibtex and biblatex, there is no other option than to convert the title to On the Prosodies of the {Greek} and {Latin} Languages
; this is the only way to have it rendered as “On the Prosodies of the Greek and Latin Languages” in Chicago, and as “On the prosodies of the Greek and Latin languages” in APA.
This, I would argue, respects user intent as best as possible.
Interesting. Which processor renders it that way? Not BibTex then.
I'm still not entirely convinced. Adding braces around {Greek}
is the only way to disclose to LaTeX you want to keep capitalisation. The easiest way to disclose that you want to have a certain capitalisation, but only if the style demands it, is to Capitalise the Source Sentence.
If you input On the Prosodies of the Greek and Latin Languages
, does the processor you have in mind do the right thing when a style that does not do title-casing?
BTW, how should this interact with caps preservation? Surely you wouldn't want On the prosodies of the Greek and Latin languages
to be translated to On the {Prosodies} of the {Greek} and {Latin} {Languages}
?
Or would you want non-capitalised non-filler words to be capitalised, and capitalised non-filler words to be braced? How about something like iPod? This would be capitalised in this scheme. I'm not too keen on the <span class="nocase">…<span>
workaround. It seems easier to provide a "capitalise this title" function to Zotero to just fix the input (assuming such can easily be done).
Ugh, I can't add things to the reference edit pane without some crazy shady monkey patching. That is going to be too brittle. On the whole reference is not a problem though.
Why specifically for english though? Doesn't this apply to other languages equally?
I have some ideas on how to get this to work, but I'll probably put it behind a preference
Is the list of fields that should be capitalised the same as the list that should get preserve caps?
To clarify your earlier question, this doesn't need to be applied to CSL JSON, since citeproc handles the capitalization already; Zotero recommends that all titles be stored sentence-case.
That was my earlier point actually. Why not have the user store the titles sentence-cased in the first place?
That is the official recommendation.
I've taken a few stabs at it but it gets increasingly messy and fragile. I'm sorry, but I'm not going to honor this one.
Why specifically for english though? Doesn't this apply to other languages equally?
No, only English has both title-case and sentence-case styles.
Or would you want non-capitalised non-filler words to be capitalised, and capitalised non-filler words to be braced?
Exactly.
How about something like iPod? This would be capitalised in this scheme.
EDIT: “iPod” shouldn’t be capitalised by BBT, and it should be protected.
Is the list of fields that should be capitalised the same as the list that should get preserve caps?
Yes. bib(la)tex needs titles in title case, and those words that must not be lowercased again by sentence-case styles such as biblatex-apa need protection.
I'm sorry, but I'm not going to honor this one.
That’d be a pity. It’s necessary since the conventions of bib(la)tex and CSL are incompatible: bib(la)tex expects titles in title-case, and words that must not be lowercased must be protected, but CSL expects titles in sentence-case, and words that must not be uppercased must be protected. (The latter doesn’t happen so very often, but without protection CSL title-case styles would turn, e.g., “nm” (nanometer) into “Nm” (Newtonmeter), something that should really be avoided.)
<span class="nocase">…<span>
, BTW, is officially supported by citeproc-js and pandoc-citeproc.
Over at pandoc, we’ve been through this whole exercise when writing pandoc-citeproc’s biblatex -> CSL converter (the inverse of what I’d like BBT to do), but it’s not that complicated after all, and seems to work great.
But how would I know that iPod should be excluded from capitalization? And why would it not be better to assume title-case and convert title-case to sentence-case for CSL? That seems to be a lot simpler to me.
This one is not going to be easy. It will require rethinking of the way I convert the HTML-ish input to LaTeX.
But how would I know that iPod should be excluded from capitalization?
All words in camel case should be excluded.
And why would it not be better to assume title-case and convert title-case to sentence-case for CSL? That seems to be a lot simpler to me.
The CSL folks insist storing titles in sentence case is simpler: From http://docs.citationstyles.org/en/stable/specification.html#sentence-case-conversion:
CSL processors don’t recognize proper nouns. As a result, strings in sentence case can be accurately converted to title case, but not vice versa.[*] For this reason, it is generally preferable to store strings such as titles in sentence case, and only use [in CSL style files the]
text-case
[attribute] if a style desires another case.
See also https://www.zotero.org/support/kb/sentence_casing
EDIT: [*] This statement, BTW, is just as false as if biblatex claimed “strings in title case can be accurately converted to sentence case, but not vice versa” (though biblatex claims no such thing). In truth, both need (and processors for both have) some mechanism for protecting strings from being converted.
The problem here is with the term "words". So far, BBT has done char-for-char translation so far to escape such problems:
T'Pau
one word or two? If two, are exactly,said he
and P.T. Barnum
then also two?thar<span class="nocase"> she <b>blow</b>s
is valid input, so "word boundary" in the regex sense wouldn't pick out "words" rightCaps preservation runs after all these problems have been evaded by just scanning through the resulting LaTeX string, but that won't do, because prosodies
would be offered to the caps preserver capitalized, and would thus be protected.
I'm not saying it can't be done (it can) or that it shouldn't (I'm open to it), but it's a complicated matter in which I'd have to spec out the entire transformation process of Zotero input to LaTeX output, as the existing simpler character transformation pipe can't just easily be adjusted to do the work, and it's sufficiently complex that I can't just start tinkering until I get it right.
Ahh, this is a mess! Sometimes using LaTeX, sometimes using the Libreoffice plugin, storing things correctly in Zotero and getting a correct output with all tools and different styles becomes difficult (though I must admit that I have mostly used APA now). CSL will hopefully "soon" (http://xbiblio-devel.2463403.n2.nabble.com/Sentence-case-variants-td7578968.html) support uppercasing subtitles, so the recommendation would "soon" be to store titles as "Writing a good title according to Johansson: the importance of subtitles". Then CSL would keep it as it for Vancouver-styles, capitalize "the" for APA-styles ("Writing a good title according to Johansson: The importance of subtitles"), and title-case it for Chicago-styles: ("Writing a Good Title According to Johansson: The Importance of Subtitles").
But before this is implemented, I store my titles as: "Writing a good title according to Johansson: The importance of subtitles".
The desired biblatex-form for this (very simple) example should then be "Writing a Good Title According to {Johansson}: The Importance of Subtitles". The "the" should not be protected (to allow for Vancouver-styles), which it of course is if I store it as uppercase as now.
Perhaps I should just write and cite in Swedish where there is kind of one rule (title-case doesn't exist).
Some of the discussions: https://forums.zotero.org/discussion/35190/1/beta-capitalization-after-colons/ https://forums.zotero.org/discussion/45811/titlesubtitle-gui/
CSL will hopefully "soon" … support uppercasing subtitles
By “uppercasing subtitles”, you mean “uppercasing the first character of the subtitle”, right? – For Zotero, there is a citeproc-js processor plugin that does this: https://juris-m.github.io/downloads/, look for “Propachi Upper”.
I’m not sure OTTOMH how biblatex handles this, but not protecting the first character of the subtitle is most likely a good idea.
with subtitle, you mean the stuff after the colon?
Can we start fleshing out an algorithm and test cases? I've thought about this long and hard, and I've come to the conclusion that the previous staged approach (capitalize, convert to LaTeX char-by-char, then protect) won't work, because:
<i>...</i>
messes with it, and the notion of "character" in relation with "position" is problematic to begin with because of the wonderful mess that is javascript unicode handlingSo from the looks of it, I will be writing a backtracking recursive descent parser for this by hand. This is a major undertaking, so having a good idea of the algorithm and test cases will be crucial. Way too easy to get this wrong.
Why not protect first, then capitalize & convert to LaTeX char-by-char?
Protection consists of putting braces around words; if I do that in step one, they'll be converted to \{
in step 3.
Plus, if I inject anything before capitalization, I will likely mess with the "wordiness" of the string of characters, which in turn interferes with capitalization.
Protection consists of putting braces around words …
I’d use something else in this first step. The title may already contain <span class="nocase">…</span>
; maybe this could be used to protect all other uppercase and camelcase words, too. The last step, after capitalization & conversion, would be to replace <span class="nocase">…</span>
by {…}
.
I'll think about this. It requires a rewrite of the HTML converter, but it's less tricky than a rec-desc parser for sure. Test cases would still be of paramount importance, as I can't rely on the existing test suite -- things will legitimately change, so I'll need a reference set for what is the correct behavior.
By “uppercasing subtitles”, you mean “uppercasing the first character of the subtitle”, right? Yes, of course, sloppy writing.
with subtitle, you mean the stuff after the colon? Yes!
Thanks for thinking through this. I wonder if someone has put up a list of "tricky cases" in the Zotero/CSL-discussions.
It's just that I seem to recall a discussion where the comma was considered to be the subtitle separator.
In any case, what exactly ought to be done with a subtitle?
As I understand it, Zotero users should (at least when CSL supports this properly) store everything except the first word and proper nouns as lowercase, i.e: "Writing a good title according to Johansson: the importance of subtitles". Then biblatex (I don't know about bibtex) should get this title-cased, with capitalized words beyond the first word protected: "Writing a Good Title According to {Johansson}: The Importance of Subtitles".
In this case subtitles doesn't have to be considered explicitly at all as far as I can tell. But in the case of other conventions for user data... Well, it gets messy, perhaps nothing should be done? I guess any capitalization done by betterbibtex should be switchable via some option.
You'd better believe it when it comes to the latter.
Could you start putting in a test case for the case you just mentioned? It would at least give me an initial target
A few test cases: CVPEAV8P
Expected biblatex output:
@book{-estimateur,
title = {Estimateur d'un défaut de fonctionnement d'un modulateur en quadrature et étage de modulation l'utilisant},
langid = {french},
note = {in French, so no case conversion, nor protection.}
}
@book{-stochastic,
title = {A Stochastic Model of {TCP Reno} Congestion Avoidance and Control},
langid = {american}
}
@book{-remarks,
title = {Some Remarks on {’t Hooft’s} {S}-Matrix for Black Holes},
langid = {american},
note = {Zotero’s “Language” field empty, so “en-US” is assumed.}
}
@book{-alkanethiolate,
title = {Alkanethiolate Gold Cluster Molecules with Core Diameters from 1.5 to 5.2~{nm}},
langid = {american}
}
@book{-highspeed,
title = {High-Speed Digital-to-{RF} Converter},
langid = {american}
}
@article{-effect,
title = {Effect of Immobilization on Catalytic Characteristics of Saturated {Pd-N}-Heterocyclic Carbenes in {Mizoroki-Heck} Reactions},
langid = {american}
}
@article{-carbocyclic,
title = {A Carbocyclic Carbene as an Efficient Catalyst Ligand for {C}–{C} Coupling Reactions},
langid = {british}
}
@article{-pleistocene,
title = {Pleistocene {\emph{Homo sapiens}} from {Middle Awash}, {Ethiopia}},
langid = {british}
}
Super, that should give me a start. That last case is going to be
@article{-pleistocene,
title = {Pleistocene {\emph{Homo sapiens}} from {Middle} {Awash}, {Ethiopia}},
langid = {british}
}
BTW. Grouping together protected words adds a new layer of complexity.
Do we still need no/inner/all, or can it just be a binary yes/no given the new scheme?
There is only one way of entering titles in bibtex and biblatex databases that ensures correct output with both title-case styles (e.g., Chicago) and sentence-case styles (e.g., APA), and this is entering them in title-case, with proper names and other strings that must not undergo case conversion wrapped in braces for protection.
Thus I would say we do not need an option here at all.
there is no non-breaking space in the source of 5.2~{nm}
-- do you want it in the source (then please add it by editing it here -- sorry, I am clueless about how to enter unicode), or should I adjust the bibtex?
In the "remarks" entry there is no language specified -- in which case I don't currently generate one. Is it proper to blithely generate "american" in its absence?
Would you prefer {\emph{Homo sapiens}}
or \emph{Homo sapiens}
? From my understanding, both would achieve the same.
The hint for using <span class="nocase">…<span>
in phase 1 seems to have done it. I think it's actually sort of done.
I could seriously use a hand in adjusting the test cases for the mismatches from the new behavior
Note that there are genuine errors in there that I need to fix, but the bulk just needs to be updated.
Oh and for now that only concerns the BibLaTeX tests. Do we have a resolution yet for how to do the whole caps preservation for BibTeX? And what fields should be titlecased? Currently it's only title.
In the "remarks" entry there is no language specified -- in which case I don't currently generate one. Is it proper to blithely generate "american" in its absence?
I thought so, but no. Biblatex manual: “\MakeSentenceCase{⟨text⟩}
… It only converts the ⟨text⟩ to sentence case if the langid
field is undefined or if it holds a language declared with \DeclareCaseLangs
… By default, american
, british
, canadian
, english
, australian
, newzealand
as well as the aliases USenglish
and UKenglish
.”
there is no non-breaking space in the source of 5.2~{nm}
It’s there, in edit mode represented by a small centered dot.
Would you prefer {\emph{Homo sapiens}} or \emph{Homo sapiens}? From my understanding, both would achieve the same.
\emph{Homo sapiens}
. – {\emph{Homo sapiens}}
actually does not protect at all (tested with biblatex-apa); it seems a command or even just a backslash after the opening brace removes the protection. To really force protection, {{\emph{Homo sapiens}}}
seems to work.
Ugh, javascript and unicode. The NBS is there, but it's encoded differently than I had expected. I will figure that one out.
So, on the \emph
, what is the preference?
Wait, you are saying DeclareCaseLangs
is not fixed. Should I stick those languages in a preference so the translator can match?
OK, I think I got the NBS.
Crap:
- title = {From Bell’s Theorem to Secure Quantum Key Distribution},
+ title = {From {Bell}’S {Theorem} to {Secure Quantum Key Distribution}},
how to prevent this?
@nickbart1980 says: