\MakeSentenceCase and friends doesn’t work with constructs like {\char"00E5}

hmalmedal commented 12 years ago

The question biblatex-apa and Unicode characters on TeX.SX demonstrates that the command \MakeSentenceCase doesn’t work for constructs such as {\char"00E5}.

josephwright commented 12 years ago

I'll take a look at this: I think it's in my 'area' for biblatex :-)

josephwright commented 12 years ago

This is actually at least in part due to the standard behaviour of \MakeLowercase (which does the case changing): try for example

\MakeLowercase{{\char"00E5}{\char"00E6}}}

which will fail. The way biblatex approaches this issue is to assume that groups which start with a control sequence need case changing, but those that do not should be left alone. Thus we can fix the behaviour with

\long\def\blx@mksc@ingroup#1&#2{%
  \if\noexpand\@let@token\relax
    \if\@let@token\char
      \blx@mksc@nocase{{#2}}%
     \else
    \blx@mksc@locase{{#2}}%  
    \fi
  \else
    \blx@mksc@nocase{{#2}}%
  \fi}

with the only question therefore being do we want to support this? Using \char does seem a bit odd in a world of UTF-8 input.

demdren commented 12 years ago

Since I'm the guy who raised the issue on tex stackexchange, I would like to suggest that you do want to support this. Back when I used MSWord and I would submit papers both in .doc and .pdf formants, the receivers/editors would often guess what characters I had used, and they would try to somewhat replicate them using the fonts they had. As a result, I would often need to send them letters pointing out character by character what the Unicode codes were, so that they would get it right.

The advantage of Unicode in XeLaTeX is that characters can be defined directly with the Unicode code in the document. Anyone who receives my latex documents will therefore easily be able to identify what the intended characters are. As an example, if I write a document where I use the International Phonetic Alphabet (IPA) symbol for the retroflex flap, I will include the following line in my preamble:

\newcommand{\IPAflap}{\char"027D}

Then the receivers will know exactly what character is meant by \IPAflap when I use it. They don't need to rely on having the same fonts as me, and it can easily be read in any text editor on any system. I consider this a huge benefit of XeLaTeX, and this was actually the main reason I decided to migrate from BibTeX to biblatex, since biblatex is supposed to support Unicode.

Finally, it is also a huge benefit for myself. Fonts that contain many unusual characters that I use are typically serif fonts, but I think very few people like to use serif fonts in their text editors. I surely don't. I like to use fonts like Lucida Console. Fonts of this type don't have many of these characters, which means that I actually wouldn't be able to display them in my text editor, despite supporting UTF-8. I wouldn't want my preamble to display lines such as:

\newcommand{\IPAflap}{�}

Not to mention that I would not want my receivers to see this either, since that would leave them guessing again what the character really is.

plk commented 12 years ago

Biber automatically tries to convert any character macros into Unicode so that sorting will work. It did not, however, convert these \char macros. In the beta of version 1.2 on SourceForge, it now does. I think this is probably the way to deal with this. It does mean that in the .bbl, these will have already been converted to UTF-8 which is correct as otherwise sorting, substring operations etc. won't work at all.

josephwright commented 12 years ago

On why I'm not sure about altering the LaTeX code, there are a couple of things. First, I'm wary of picking out individual control sequences as 'special cases', as it makes the documentation more tricky. Secondly, I'm wary of using \char directly in a document, and would strongly favour creating proper definitions for these, e.g.

\chardef\IPAflap="027D %

in the preamble.

PLK has indicated that for Biber this can be solved at the Biber end (which I note won't work with \chardefed tokens). I'm still not sure what the correct fix is when using BibTeX.

plk commented 12 years ago

I'm not sure I'd even try to get this to work with bibtex - using latex macros of any sort in the .bib file makes general sorting in principle impossible. Biber deals with this by converting everything to UTF-8 before doing anything. BibTeX can't do that so even if you change the case macros, sorting would never work.

josephwright commented 12 years ago

As PLK says, it's not clear that supporting this for BibTeX is actually desirable. I also notice that no-one else has spotted this to date, and biblatex has been around a while now. On the other hand, the fix I suggest will at least make things work 'to some extent'. But it would presumably only be done in the BibTeX branch, which I don't like very much.

plk commented 12 years ago

If the OP can use biber, then it should be fine. I'd rather have this work completely and ask people to use biber. Can the OP try biber 1.2 (which requires biblatex 2.1) and update this ticket?

demdren commented 12 years ago

Just to be clear, I am using both biber and biblatex, and I am not asking or expecting anything to be done to make this work with BibTeX, since (as I understand it) people won't expect BibTeX to support Unicode anyway. If the next version of biber will fix the issue I raised (as plk says it will), I am happy :) Since I'm a novice user of TeX, I prefer to wait until a stable release of biber 1.2 has been released for texlive rather than install a beta.

plk commented 12 years ago

Good. biblatex 2.1 and biber 1.1 are both released and will be in TL soon. you are fairly safe installing biber 1.2 on top of these - the only changes in there are the moment are this one and something unrelated about range parsing.

plk / biblatex

\MakeSentenceCase and friends doesn’t work with constructs like {\char"00E5} #24