plk / biblatex

biblatex is a sophisticated bibliography system for LaTeX users. It has considerably more features than traditional bibtex and supports UTF-8
515 stars 118 forks source link

\MakeCapital not working with № #960

Closed odomanov closed 4 years ago

odomanov commented 4 years ago

\MakeCapital doesn't work with strings started with the symbol № (with pdflatex). The error is:

! Package inputenc Error: Unicode character   (U+0084)    
(inputenc)                not set up for use with LaTeX.  

\textnumero works fine.

\documentclass{article}

\usepackage{biblatex}
\usepackage{textcomp}

\begin{document}
№ \textnumero\ ---
\MakeCapital{№ 1} 
\MakeCapital{\textnumero~2}
\MakeCapital{aaa№} 
\end{document}
moewew commented 4 years ago

The same error can be reproduced with

\documentclass{article}

\usepackage{textcomp}

\begin{document}
\uppercase{№}
\end{document}

At some point we'll probably switch to l3text

\documentclass{article}

\usepackage{textcomp}
\usepackage{expl3}

\ExplSyntaxOn
\newcommand*{\newuppercase}{\text_uppercase:n}
\ExplSyntaxOff

\begin{document}
\newuppercase{№}
\end{document}

But I'm not sure if we can find a way to fix this in the meantime.

odomanov commented 4 years ago

Does this mean that the only way to cope with this now is to replace № with \textnumero?

moewew commented 4 years ago

Basically yes. - Or use a Unicode engine.

It's similar ä (which won't break, but also won't be capitalised - \"a works as expected).

I don't think it makes sense to try and fix up what biblatex does here. But since I hope to be able to switch to l3text in the not too distant future anyway, this shouldn't be much of an issue.

odomanov commented 4 years ago

I see, thank you. I can also see that \MakeCapital doesn't work with Cyrillic letters --- no errors, simply doesn't capitalize. This probably should also wait for l3text.

moewew commented 4 years ago

@josephwright Is expl3 supposed to be able to deal with the following?

\documentclass{article}

\usepackage{textcomp}
\usepackage{expl3}

\begin{document}
\ExplSyntaxOn
\text_titlecase_first:n{\textnumero}
\end{document}
moewew commented 4 years ago

A proof of concept with expl3 case changing functions is at https://github.com/plk/biblatex/compare/dev...moewew:l3text. There are a few things that still need to be thought through

  1. The expl3 case changing functions don't do brace protection. I think that is a good design decision given that braces mean a lot of things, especially in the BibTeX context.
    • Should we try and get brace protection back?
    • If not, we probably need to make expl3 case changing optional to avoid backwards compatibility issues. (Or make it the default and offer an opt-out.)
  2. The case changing functions have a language option, we should probably try and use it (so far the code uses the single-argument versions.) That may need some thought and conversion of babel/polyglossia names.
josephwright commented 4 years ago

@moewew There are still things to do in the expl3 code for emulating \protected@edef: I'll have to work on it.

josephwright commented 4 years ago

@moewew Issues with \textnumero (etc.) fixed in expl3 for next release.

moewew commented 4 years ago

@moewew Issues with \textnumero (etc.) fixed in expl3 for next release.

Thank you very much. I also asked to get a feeling what expl3 wants to support here. As I understand you are aiming at a very general solution, so am I right in thinking that everything that could reasonably appear in a title field should be OK?

Do you have any opinion about brace protection (see my two questions above)? Or a feeling how difficult it would be to get brace protection back for the case changing functions?

josephwright commented 4 years ago

@moewew The aim is to cover 'any reasonable text', which means emulating \protected@edef as far as I can.

On the brace business, it's all doable but it's a question of interfaces. I'd have to provide a \text_uppercase_non_recursive:n or have some switch. It's mainly a question of effort. Perhaps one for a mail to the team? We are thinking of changing \MakeUppercase itself, or at least \MakeTextUppercase, so input is really useful. (@davidcarlisle might have thoughts here.)

davidcarlisle commented 4 years ago

\uppercase isn't usable anyway on non ascii text, but if you are not ready to switch to expl3 yet \MkeUppercase and \MakeTextUppercase both seem to work fine on

\documentclass{article}

\usepackage{textcomp}

\begin{document}
\MakeUppercase{abc №}
\end{document}

Not sure what you mean by brace protection here?

josephwright commented 4 years ago

@davidcarlisle The starting point here is \MakeCaptial, which is basically \MakeUppercase #1. Brace protection is the BibTeX-like form of 'escaping' from case changing.

moewew commented 4 years ago

Just so it doesn't get lost.

Regarding https://github.com/plk/biblatex/issues/960#issuecomment-575914507 I think that many people will rely on brace protection in their existing bib databases. Most importantly in cases where English titles get sentence-cased and people protect words that must not be lower-cased (such as [language] names).

So in order to not break documents I'd very strongly opt for maintaining brace protection. In the long run, I'd argue in favor of implementing for some sane semantic markup for case protection (as I did in another ticket).

Originally posted by @jspitz in https://github.com/plk/biblatex/issues/941#issuecomment-575985644

davidcarlisle commented 4 years ago

@josephwright oh that . If we could revise history to make that (and {\'e} markup) go away it would be a good thing. But I suppose there are too many existing bib files.....

I wouldn't integrate that into the main tex-level case changer, but we could have a top level prepass function that converts

{abc {Keep Mixed} zzz} to {abc \NoCaseChange{Keep Mixed} zzz}

then \MakeTextUppercase would do the right thing.

you could write it in expl3 (or 2e, if needed), but alternatively couldn't biber have an option to do that while extracting the fields from the bib file? It really is a bibtex syntax rather than TeX one so handling it at the bib file parse level would seem reasonable to me.

moewew commented 4 years ago

@plk Could Biber convert

title = {Text {Protect} Text},

to

title = {Text \NoCaseChange{Protect} Text},

I was actually hoping to be able to use the expl3 change as an incentive to make have users move away from {...} for case protection to something more reasonable like \NoCaseChange. As in: We'll make expl3 case changing optional, it won't support brace protection, but will work properly for all the other stuff that biblatex's own case changer currently can't handle. But @jspitz's comment makes me think people won't be willing to accept that, so we may have to look into brace protection.

jspitz commented 4 years ago

The point is that most users won't read your release notes, and I suppose many will not even notice the case (mis-)change in their documents caused by such a change. So unless you want to make expl3 casing opt-in (which would defeat the whole exercise IMHO) something that deals with brace protection is a must, unless you want to upset users. A biber-level way seems fine to me (not sure about the status of the bibtex[8] engine nowadays.

moewew commented 4 years ago

Oh yes I forgot that, if Biber has to do the case protection conversion, then BibTeX is an issue.

davidcarlisle commented 4 years ago

@moewew I was thinking that bibtex was OK as it already detects this use, but I suppose for biblatex you really need something like \NoCaseChange to be inserted so you can handle the text later, and bibtex simply detecting the braces and skipping case changing is not enough... It could be be written in tex, shame though as it's probably only a line of perl:-)

plk commented 4 years ago

Yes, we can do this in biber but it's always a little messy with edge cases as it has to be done with regexps and people can (and do) put arbitrary TeX into datasource fields which makes simple brace protection non-trivial to detect.

moewew commented 4 years ago

No, biblatex doesn't uses BibTeX's case changing function. The backend passes the text through unchanged and case changing happens on the LaTeX side. (I guess because the idea is that all formatting happens on the LaTeX side.)

davidcarlisle commented 4 years ago

something not unlike this may be enough to do it in TeX, (only tested on this one string)

\documentclass{article}

\usepackage{textcomp}

\usepackage[overload]{textcase}

\begin{document}

{abc {Keep Mixed} zzz  \textbf{but upper this}}

\MakeUppercase{abc {Keep Mixed} zzz  \textbf{but upper this}}

\def\zz#1{\MakeUppercase{\expandafter\zzz\space #1\endzzz. {}}}
\def\zzz#1 #{ #1\zzz\NoCaseChange}
\def\endzzz#1#2#3{}

\zz{abc {Keep Mixed} zzz  \textbf{but upper this}}

\end{document}
jspitz commented 4 years ago

I suppose it should ideally also handle nested braces:

\zz{abc {Keep Mixed} zzz  \textbf{but upper this {not this}}}
davidcarlisle commented 4 years ago

@jspitz for expl3 case changer (which is stepping through character by character anyway) that may be possible, for \Make(Text)Uppercase it really doesn't make sense to add hundreds of lines of fragile tex code to add this on top of the existing code which is just a thin wrapper around \uppercase, so I think for a non expl3 setting that's as much as is reasonable.

(I don't actually know exactly what criterion bibtex uses to classify braces (I could check the sources) but as the exact rules force accents to be added as {\'e} not \'{e} and so break kerning and ligatures, the temptation not to follow them exactly might be strong...

jspitz commented 4 years ago

@davidcarlisle I'd say the goal should be to provide, as much as possible, a way that doesn't change the output for users who use brace protection when expl3 casing is introduced. Biblatex's logic (which is not consistent) does already differ from BibTeX's here, but this is documented in the manual (sec. 4.6.4, \MakeSentenceCase).

davidcarlisle commented 4 years ago

@jspitz sure that is a good aim, but bibtex's rules are very weird and for example the example you posted earlier

{abc {Keep Mixed} zzz \textbf{but upper this {not this}}}

bibtex would skip the entire \textbf argument, as it skips all brace groups unless the content of the group starts with a \abc csname. there really is no good place that one can insert \NoCaseChange to emulate that behaviour.

josephwright commented 4 years ago

We can follow whatever logic we want :) Sounds like what we want is a 'wrapper': \text_bibtex_to_expl:n or some such. If I match the Biber behaviour in an expandable function that could be used

\text_titlecase:n { \text_bibtex_to_expl:n {#1} }

would that 'work'?

jspitz commented 4 years ago

@jspitz sure that is a good aim, but bibtex's rules are very weird and for example the example you posted earlier

{abc {Keep Mixed} zzz \textbf{but upper this {not this}}}

bibtex would skip the entire \textbf argument, as it skips all brace groups unless the content of the group starts with a \abc csname.

Frankly, I have not tested this with Biblatex. For Biblatex, {Abc {Keep Mixed} Zzz \textbf{but Lower This}} the argument of \textbf would not be sentence-cased unless you do {Abc {Keep Mixed} Zzz {\textbf{but Lower This}}}

josephwright commented 4 years ago

@davidcarlisle It's not 100s of lines ;)

moewew commented 4 years ago

I was actually hoping that the switch to expl3 would be a good pretext to drop some of the weird biblatex case protection behaviour for an overall more sane approach. I appreciate that that has backwards compatibility implications, but I was hoping to keep the main features the same and maybe drop some of the more obscure rules. People who rely on the more obscure stuff are hopefully happier to read release notes and accept that some change might be useful.

josephwright commented 4 years ago

@moewew A compatibility function of the type I've suggested would be opt-in; quite easy to arrange that older behaviour is deprecated.

jspitz commented 4 years ago

IMO some of the weirder behaviors could be dropped indeed, as this has always been shaky (and even documented as such). The main protection (grouping via brace or macro and the undoing of macro grouping via outer macro) should probably be emulated.

plk commented 4 years ago

One option here is to extend biber's data annotation feature so that ranges of characters can be semantically tagged, something like:

TITLE = {Some title with a protected part}
TITLE+an:protected - {r=19-27}

This could generally be used to get rid of markup in data. It would require macros in biblatex to apply some formatting to a character range in a string (I assume expl3 has such fancy things ...). I am not convinced this is very useful but it's something that has occurred to me from time to time.

josephwright commented 4 years ago

Proof of principle:

\cs_new:Npn \text_bibtex_to_expl:n #1
  {
    \__text_bibtex_loop:w #1
      \q_recursion_tail \q_recursion_stop
  }
\cs_new:Npn \__text_bibtex_loop:w #1 \q_recursion_stop
  {
    \tl_if_head_is_N_type:nTF {#1}
      { \__text_bibtex_N_type:N }
      {
        \tl_if_head_is_group:nTF {#1}
          { \__text_bibtex_group:n }
          { \__text_bibtex_space:w }
      }
    #1 \q_recursion_stop
  }
\cs_new:Npn \__text_bibtex_N_type:N #1
  {
    \quark_if_recursion_tail_stop:N #1
    \exp_not:n {#1}
    \__text_bibtex_loop:w
  }
\cs_new:Npn \__text_bibtex_group:n #1
  {
    {
      \bool_lazy_and:nnTF
        { \tl_if_head_is_N_type_p:n {#1} }
        {
          \exp_after:wN \token_if_cs_p:N \exp_after:wN { \tl_head:w #1 \q_stop }
        }
        { \exp_not:n {#1} }
        { \exp_not:n { \NoCaseChange {#1} } }
      }
    \__text_bibtex_loop:w
  }
\exp_last_unbraced:NNo \cs_new:Npn \__text_bibtex_space:w \c_space_tl
  {
    \c_space_tl
    \__text_bibtex_loop:w
  }
davidcarlisle commented 4 years ago

@josephwright ah OK you add \NoCasChange inside the braces, that's simpler and better than I had in mind. I was not seeing a good place to add \NoCaseChange in the macro argument case as I was thinking of changing {abc} to \NoCaseChange{abc} ie putting it before the brace....

so

\let\test\text_bibtex_to_expl:n
\ExplSyntaxOff

\typeout{\test{abc {abc} abc}}
\typeout{\test{abc \textit{abc} abc}}
\typeout{\test{abc {\itshape abc} abc}}

gives

abc {\NoCaseChange {abc}} abc
abc \textit {\NoCaseChange {abc}} abc
abc {\itshape abc} abc

which looks good to me

jspitz commented 4 years ago

Yes, the result looks good indeed. As to the macro, I wonder whether \NoCaseChange is a good choice (due to the name clash with textcase). \KeepCase maybe (or, shorter to write, \KpCase)?

jspitz commented 4 years ago

And while talking about the UI: It would be nice to have a global solution to case-protect lexical items. Something like

% language-specific
\AddtoCaseProtection[english]{English,German,French,...}
% general
\AddtoCaseProtection{USA,APA,Knuth,...}

which could be used in the document preamble, or *.lbx, or *.bbx (of styles that use \MakeSentenceCase)

I understand that something along this line has also been considered on the latex3 level in the long run, but still, an interface on biblatex level would be very comfortable.

josephwright commented 4 years ago

On the naming, it would be best to pick something that is clear and descriptive. We are talking about a marker to go into BibTeX files and presumably to be shared with other tools. There's no real issue in calling it \NoCaseChange as nothing is baked into expl3 here, and the textcase definitions a simple no-op. So one can use the same command name for both approaches.

\documentclass{article}
\usepackage{expl3}
\usepackage{textcase}
\ExplSyntaxOn
\tl_put_right:Nn \l_text_expand_exclude_tl { \NoCaseChange }
\tl_put_right:Nn \l_text_case_exclude_arg_tl { \NoCaseChange }
\ExplSyntaxOff
\begin{document}
\MakeTextLowercase{\NoCaseChange{iPhone} iPhone}
\ExplSyntaxOn
\text_lowercase:n { \NoCaseChange { iPhone } ~ iPhone }
\ExplSyntaxOff
\end{document}
jspitz commented 4 years ago

My cons on \NoCaseChange, apart from the textcase usage, would be that it is rather hard to type (given that case protection have to might be used a lot) and that it does not really fit the biblatex style of command naming. But this is just a minor note. I will be happy with whatever you come up with, as this is definitely a huge improvement over the status quo.

josephwright commented 4 years ago

@jspitz Like I said, the point here is there is nothing 'baked in' to expl3. So you can pick whatever name you feel is best: it could be something very short, though I'm not sure that's a great plan. The expl3 mechanism works with whatever commands it's 'fed'. Obvious short-ish but descriptive name is \FixedCase.

plk commented 4 years ago

Is the intention here to have this macro in the bibtex source data? If so, since it is enforcing a user change, at least for biber, I would prefer dedicated data annotations which pick out specific words or character ranges for protection, leaving no macros in the data. This also allows another level of abstraction for selecting different macros in the style.

jspitz commented 4 years ago

@josephwright my comment was probably more addressed to @moewew (or whoever is going to implement this to biblatex). @plk yes that's the intention. And I agree that macros in the data should be avoided (see my proposal in https://github.com/plk/biblatex/issues/960#issuecomment-578476374), but I suppose it cannot be avoided completely.

josephwright commented 4 years ago

@plk As already noted, it's quite possible to take the existing BibTeX brace approach and massage that into something with more reasonable mark-up.

I can suggest code to pick out words/phrases and protect them: single words are relatively easy, picking out a word in a phrase slightly less so (but still doable). We might have something like

\text_titlecase:n
  { \text_autocase:nn { <language> } { <input> } }

where the 'autocase' stuff could be a biblatex-specific macro, as more generally explicit markup is safer.

moewew commented 4 years ago

Actually I would prefer to keep Biber out of this as far as possible. I only asked about the feasibility of Biber doing something here because David brought it up.

plk commented 4 years ago

I bring it up only because the data annotation feature was partly designed to keep formatting macros out of bibtex data.

davidcarlisle commented 4 years ago

I'd agree with that avoiding explicitly tex markup in bib files is a good aim. although annotating the field via character counts seems a bit fragile as an editing experience. Perhaps the classic bibtex brace markup is a reasonable compromise in the end. Its main downside, introducing markup like {\'e} for accented letters which then disturb interletter kerns, isn't really an issue these days as you can tell users to avoid that and use é or even \'e instead.

josephwright commented 4 years ago

An explicit mark-up is always going to be more robust than an implicit one like the brace approach: we went for explicit mark-up in \text_titlecase:n for a reason. But I see that BibTeX files are somewhat tricky as they are not 'just LaTeX' sources. So we probably need to support the brace approach, as suggested above. (The issue with braces is that \foo{bar} might be a macro that takes an argument, or it might be a letter-like command that is followed by case-fixed input, and we can't be sure.)

Looks like what we want is

We've got the first two, I think the last is doable, that addresses the use cases and makes people reasonably happy, no?

davidcarlisle commented 4 years ago

@josephwright that's why in the sketch 2e code above I just checked for a brace group preceded by a space or start of string, it's not fully compatible with legacy bibtex markup but it does mean that \textit{zzz} braces don't get picked up. the vast majority of cases where you want to preserve case are surely whole words.

plk commented 4 years ago

I'm all for a macro instead of just braces - one of the main problems for the TeX->UTF-8 conversion that biber does on every datasource is trying to distinguish braces that need leaving in for protection from braces delimiting macro arguments. Things that make this easier or eliminate all markup from data, I am in favour of.

josephwright commented 4 years ago

@plk I think we are all in agreement and expl3 here provides a good way forward (as it's supposed to); will want to test carefully but should be a good use of the new(ish) code

moewew commented 4 years ago

I've run a few tests with @josephwright's code from https://github.com/plk/biblatex/issues/960#issuecomment-578435859 and I am really impressed by and happy with the result so far.

I'm hoping to switch to the expl3 case changing code for the next release. The question is whether we should offer the old code as a opt-in for backwards compatibility (for people who don't have a current expl3 or who relied on one of the really obscure behaviour of the old code that is not present in the expl3 code) or should drop it entirely.

For now I want to implement Joseph's BibTeX-to-LaTeX braces mapping, but give people an option to opt out if they prefer a more structured approach that doesn't use braces but a dedicated no-case macro.