plk / biber

Backend processor for BibLaTeX
Artistic License 2.0
339 stars 38 forks source link

Brace stripping for diacritics #316

Closed moewew closed 2 years ago

moewew commented 4 years ago

I know it is a painful subject, but I'd like to bring it up for hopefully one last time (see also https://github.com/plk/biber/issues/297#issuecomment-583756230). Especially since we now have expl3 case change in biblatex (https://github.com/plk/biblatex/pull/1005).

Biber already correctly strips the outer braces in constructs such as

{\'I} -- {\v C}

so that they appear in the .bbl only as

Í -- Č

But currently the additional inner/argument braces in the equivalent version

{\' {I}} -- {\v{C}}

are not stripped, leaving us with

{Í} -- {Č}

This has negative effects for case protection. With classical BibTeX both forms are not case protected, but since Biber does not strip the braces, Biber will accidentally brace protect the latter form.

Compare

\documentclass[british]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\usepackage{csquotes}

\begin{filecontents}[force]{\jobname.bib}
@article{lorem:a,
  title   = {ALorem and {{\v C}esk\'a} republika},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:b,
  title   = {BLorem and {{\v{C}}esk\'{a}} republikb},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:c,
  title   = {CLorem and {\v C}esk\'a republikc},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:d,
  title   = {DLorem and {\v{C}}esk\'{a} republikd},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:e,
  title   = {ELorem and {{\v C}}esk\'a republikc},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:f,
  title   = {FLorem and {{\v{C}}}esk\'{a} republikd},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:a,
  title   = {AIpsum and {{\'I}sland}},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:b,
  title   = {BIpsum and {{\'{I}}sland}},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:c,
  title   = {CIpsum and {\'I}sland},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:d,
  title   = {DIpsum and {\'{I}}sland},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:e,
  title   = {EIpsum and {{\'I}}sland},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:f,
  title   = {FIpsum and {{\'{I}}}sland},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
\end{filecontents}

\begin{document}
\nocite{*}
\bibliographystyle{plain}
\bibliography{\jobname}
\end{document}

grafik

with (run with biblatex 3.15 dev for the expl3 case changer, the latex2e implementation has some quirks with non-ASCII chars in pdfLaTeX; alternatively use a Unicode engine)

\documentclass[british]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\usepackage{csquotes}
\usepackage[casechanger=expl3]{biblatex}

\DeclareFieldFormat{titlecase}{\MakeSentenceCase*{#1}}

\begin{filecontents}[force]{\jobname.bib}
@article{lorem:a,
  title   = {ALorem and {{\v C}esk\'a} republika},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:b,
  title   = {BLorem and {{\v{C}}esk\'{a}} republikb},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:c,
  title   = {CLorem and {\v C}esk\'a republikc},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:d,
  title   = {DLorem and {\v{C}}esk\'{a} republikd},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:e,
  title   = {ELorem and {{\v C}}esk\'a republikc},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{lorem:f,
  title   = {FLorem and {{\v{C}}}esk\'{a} republikd},
  author  = {Anne Uthor},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:a,
  title   = {AIpsum and {{\'I}sland}},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:b,
  title   = {BIpsum and {{\'{I}}sland}},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:c,
  title   = {CIpsum and {\'I}sland},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:d,
  title   = {DIpsum and {\'{I}}sland},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:e,
  title   = {EIpsum and {{\'I}}sland},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
@article{ipsum:f,
  title   = {FIpsum and {{\'{I}}}sland},
  author  = {Anne N. Other},
  journal = {Journal},
  year    = {1995},
  volume  = {352},
}
\end{filecontents}
\addbibresource{\jobname.bib}

\begin{document}
\nocite{*}
\printbibliography
\end{document}

grafik

Note how the D cases are protected against case change with Biber, while they are not case protected with BibTeX. If Biber were to strip the additional braces here, we would get the same result as in BibTeX (compare the C cases).

plk commented 4 years ago

I had an idea about this I've not tried before - now implemented in DEV.

moewew commented 4 years ago

All of 'my' cases and the ADS export (cf. #297) work brilliantly!

I don' know if it's related, but https://github.com/plk/biblatex/issues/727 https://github.com/plk/biber/issues/216 seems to be an issue again.

\documentclass[british]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\usepackage{csquotes}

\usepackage[style=authoryear, backend=biber]{biblatex}

\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}
@online{fontconfig,
  author ={{\texttt{freedesktop.org}}},
  sortname = {freedesktop},
  title = {Fontconfig},
  subtitle = {A library for configuring and customizing font access},
  date = {2016-06-15},
  urldate={2017-03-18},
  url = {https://www.freedesktop.org/wiki/Software/fontconfig/}
}
@online{wikipedia,
  author    = {{\WikipediA}},
  sortlabel = {Wikipedia},
  sortname  = {Wikipedia},
  title     = {Lucida},
  date      = {2016-10-19},
  urldate   = {2017-04-03},
  url       = {https://en.wikipedia.org/wiki/Lucida},
}

@online{features,
  author    = {{\WikipediA}},
  sortlabel = {Wikipedia},
  sortname  = {Wikipedia},
  title     = {List of typographic features},
  date      = {2017-02-21},
  urldate   = {2017-03-24},
  url       = {https://en.wikipedia.org/wiki/List_of_typographic_features},
}
\end{filecontents}

\def\WikipediA{wikipedia}

\addbibresource{\jobname.bib}

\begin{document}
\nocite{*}
\printbibliography
\end{document}

gives bad familyi's

      \name{author}{1}{}{%
        {{un=0,uniquepart=base,hash=f54ec09db02860f10fd50e2ce18d24db}{%
           family={{\texttt{freedesktop.org}}},
           familyi={f\bibinitperiod}}}%
      }
...
      \name{author}{1}{}{%
        {{un=0,uniquepart=base,hash=95387a9e1a6bf37286493c821a0b17da}{%
           family={{\WikipediA}},
           familyi={}\bibinitperiod}}}%
      }
...
      \name{author}{1}{}{%
        {{un=0,uniquepart=base,hash=95387a9e1a6bf37286493c821a0b17da}{%
           family={{\WikipediA}},
           familyi={}\bibinitperiod}}}%
      }
plk commented 4 years ago

Please try 2.15 now - this was really something in the initials generation for edge cases.

moewew commented 4 years ago

The errors are gone!

I'm a bit concerned about the change from

\field{title}{Signs of W$\frac{o}{a}$nder}

to

\field{title}{Signs of W$\frac{o}a$nder}

but the versions are equivalent (even though the former is much nicer).

plk commented 4 years ago

That is a bit ugly but only happens with single characters in braces, which should be equivalent. Multi-argument macros are a huge pain to deal with here but are rare enough.

schoeps commented 4 years ago

I think one of the consequences of your recent changes is that you remove the curtly brackets around all single characters (tested with the current 2.15 dev). This is not a good idea, e.g.

@BOOK{xxx,
  AUTHOR = {X, Y},
  DATE = {2020},
  TITLE = {Part {I}},
}

will be turned into

@BOOK{xxx,
  AUTHOR = {X, Y},
  DATE = {2020},
  TITLE = {Part I},
}

by biber --quiet --tool --outfile test2.bib test.bib. And therefore "I" may become "i" in some citations styles but the brackets should have prevented this in the first place.

plk commented 4 years ago

This is a constant problem I'm afraid - the latex decoding in biber is syntactical and it's impossible to cover all cases. Generally, things like part numbers should be in fields like volume or number. Another way around this specific example is to use double-braces: {{I}}. This edge-case only occurs with single glyphs in braces and doesn't affect, for example {II}.

moewew commented 4 years ago

I know this is a pain, but is there absolutely no chance this might work?

I think the rule here is that brace stripping should only happen if the braced contents start with a backslash, or more precisely with a macro encoding the macro version of a non-ASCII char.

I can write up some more systematic tests if you like.

schoeps commented 4 years ago

I am not sure what your interpretation of "constant" is. The problem was not present in previous versions (let's say a few days ago?) and it breaks compatibility with bibtex. I have several books, journals and papers in my bibtex-file that have single capital letters in their respective titles, either due numbering or units e.g. "T" for Tesla. I guess removing brackets only in the case of following backslash should be save (since titlecase does not matter anyway here).

moewew commented 4 years ago

Ah, one other case in which we probably want braces removed is when an empty group follows a macro that Biber encoded into UTF-8 character (it appears possible to add code to capture the empty group to the code that searches these macros and replaces them with UTF-8 chars)

plk commented 4 years ago

The problem is that fixing one edge case breaks others fairly reliably in this area because we are doing this with (very complex) regexps as there is no other way short of having a full TeX parser in there. It looks simple to differentiate between \'{I} and {I} but since we have no semantics here, there is no way of knowing if a preceding macro takes any arguments or how many arguments it takes. This is always going to be a hack - so far it has been a reasonably good one but the edge cases mount and the new capitalisation code meant we had to shift the problem elsewhere. I don't like it either and will have to think about it more.

moewew commented 4 years ago

I'd have thought that it might be possible to get the few cases right that we need to get right. If we ignore initial generation, which is problematic with macros anyway, for the moment, we don't need brace stripping for all macros, we only need it for diacritic-macros like \'. Biber already knows that \' does something to the next char and can already ignore the braces around it.

plk commented 4 years ago

I'll have a look - it's just a little unsatisfactory having to break out into special cases when I've tried to keep this module as general as possible but I don't think there is much choice even though it makes things harder to maintain ...

moewew commented 4 years ago

Here are a few test cases I compiled. I will probably add to this if I find more, but I wanted to post it now in case I forgot about it.

\documentclass[british]{article}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\usepackage{csquotes}

\usepackage[style=authoryear, backend=biber]{biblatex}

\renewbibmacro*{finentry}{%
  \setunit{\newline}\printfield{verba}%
  \setunit{\newline}\printfield{verbb}%
  \finentry}

\begin{filecontents}[force]{\jobname.bib}
@book{appleby,
  author  = {Humphrey Appleby},
  title   = {Harmless uses {I} {II} {Humphrey Appleby} {H}umphrey {A}ppleby},
  verba   = {Raw: {I} {II} {Humphrey Appleby} {H}umphrey {A}ppleby},
  verbb   = {Expected: {I} {II} {Humphrey Appleby} {H}umphrey {A}ppleby},
  date    = {1981},
}
@book{bppleby:b,
  author  = {Humphrey Bppleby},
  title   = {Single letter (BibTeX style)
             {\"a} {\"{o}} {\v C} {\v{Z}}},
  verba   = {Raw: {\"a} {\"{o}} {\v C} {\v{Z}}},
  verbb   = {Expected: ä ö Č Ž},
  date    = {1982},
}
@book{bppleby:l,
  author  = {Humphrey Bppleby},
  title   = {Single letter (LaTeX style)
             \"a \"{o} \v C \v{Z}},
  verba   = {Raw: \"a \"{o} \v C \v{Z}},
  verbb   = {Expected: ä ö Č Ž},
  date    = {1982},
}
@book{cppleby:b,
  author  = {Humphrey Cppleby},
  title   = {Protected single letter (BibTeX style)
             {{\"a}} {{\"{o}}} {{\v C}} {{\v{Z}}}},
  verba   = {Raw: {{\"a}} {{\"{o}}} {{\v C}} {{\v{Z}}}},
  verbb   = {Expected: {ä} {ö} {Č} {Ž}},
  date    = {1983},
}
@book{cppleby:l,
  author  = {Humphrey Cppleby},
  title   = {Protected single letter (LaTeX style)
             -- doesn't exist},
  date    = {1983},
}
@book{dppleby:b,
  author  = {Humphrey Dppleby},
  title   = {Words (BibTeX) {\"a}s{\"a}n {\"{o}}l{\"{o}}n
             {\v C}e{\v c}en {\v{Z}}e{\v{z}}en},
  verba   = {Raw: {\"a}s{\"a}n {\"{o}}l{\"{o}}n
             {\v C}e{\v c}en {\v{Z}}e{\v{z}}en},
  verbb   = {Expected: äsän ölön Čečen Žežen},
  date    = {1984},
}
@book{dppleby:l,
  author  = {Humphrey Dppleby},
  title   = {Words (LaTeX) \"as\"an \"{o}l\"{o}n \v Ce\v cen \v{Z}e\v{z}en},
  verba   = {Raw: \"as\"an \"{o}l\"{o}n \v Ce\v cen \v{Z}e\v{z}en},
  verbb   = {Expected: äsän ölön Čečen Žežen},
  date    = {1984},
}
@book{eppleby:b,
  author  = {Humphrey Eppleby},
  title   = {Protected Words (BibTeX) {{\"a}s{\"a}n} {{\"{o}}l{\"{o}}n}
             {{\v C}e{\v c}en} {{\v{Z}}e{\v{z}}en}},
  verba   = {Raw: {{\"a}s{\"a}n} {{\"{o}}l{\"{o}}n}
             {{\v C}e{\v c}en} {{\v{Z}}e{\v{z}}en}},
  verbb   = {Expected: {äsän} {ölön} {Čečen} {Žežen}},
  date    = {1985},
}
@book{eppleby:l,
  author  = {Humphrey Eppleby},
  title   = {Protected Words (LaTeX) {\"as\"an} {\"{o}l\"{o}n}
             {\v Ce\v cen} {\v{Z}e\v{z}en}},
  verba   = {Raw: {\"as\"an} {\"{o}l\"{o}n}
             {\v Ce\v cen} {\v{Z}e\v{z}en}},
  verbb   = {Expected: {äsän} {ölön} {Čečen} {Žežen}},
  date    = {1985},
}
@book{fppleby,
  author  = {Humphrey Fppleby},
  title   = {Macros \emph{Hullo} $\frac{a}{b}$},
  verba   = {Raw: \emph{Hullo} $\frac{a}{b}$},
  verbb   = {Expected: \emph{Hullo} $\frac{a}{b}$},
  date    = {1986},
}
@book{gppleby:b,
  author  = {Humphrey Gppleby},
  title   = {Macros (BibTeX) \emph{H{\"u}llo} \emph{H{\"{e}}llo}
             \emph{{\v C}e{\v c}en} \emph{{\v{Z}}e{\v{Z}}en}},
  verba   = {Raw:  \emph{H{\"u}llo} \emph{H{\"{e}}llo}
             \emph{{\v C}e{\v c}en} \emph{{\v{Z}}e{\v{Z}}en}},
  verbb   = {Expected: \emph{Hüllo} and \emph{Hëllo} \emph{Čečen} \emph{Žežen}},
  date    = {1987},
}
@book{gppleby:l,
  author  = {Humphrey Gppleby},
  title   = {Macros (LaTeX) \emph{H\"ullo} \emph{H\"{e}llo}
             \emph{\v Ce\v cen} \emph{\v{Z}e\v{Z}en}},
  verba   = {Raw:  \emph{H\"ullo} \emph{H\"{e}llo}
             \emph{\v Ce\v cen} \emph{\v{Z}e\v{Z}en}},
  verbb   = {Expected:  \emph{Hüllo} and \emph{Hëllo} \emph{Čečen} \emph{Žežen}},
  date    = {1987},
}
@book{hppleby:b,
  author  = {Humphrey Hppleby},
  title   = {Protected macros (BibTeX) {\emph{H{\"u}llo}}{\emph{H{\"{e}}llo}}
             {\emph{{\v C}e{\v c}en}} {\emph{{\v{Z}}e{\v{Z}}en}}},
  verba   = {Raw: {\emph{H{\"u}llo}} {\emph{H{\"{e}}llo}}
             {\emph{{\v C}e{\v c}en}} {\emph{{\v{Z}}e{\v{Z}}en}}},
  verbb   = {Expected:  {\emph{Hüllo}} and {\emph{Hëllo}} {\emph{Čečen}} {\emph{Žežen}}},
  date    = {1988},
}
@book{hppleby:l,
  author  = {Humphrey Hppleby},
  title   = {Protected macros (LaTeX) {\emph{H\"ullo}} {\emph{H\"{e}llo}}
             {\emph{\v Ce\v cen}} {\emph{\v{Z}e\v{Z}en}}},
  verba   = {Raw: {\emph{H\"ullo}} {\emph{H\"{e}llo}}
             {\emph{\v Ce\v cen}} {\emph{\v{Z}e\v{Z}en}}},
  verbb   = {Expected: {\emph{Hüllo}} and {\emph{Hëllo}}  {\emph{Čečen}} {\emph{Žežen}}},
  date    = {1988},
}
\end{filecontents}
\addbibresource{\jobname.bib}

\begin{document}
\nocite{*}
\raggedright
\printbibliography
\end{document}
plk commented 4 years ago

Please try DEV now - I put a fix in which addresses single-char diacritic macros which fixes the Part {I} example.

schoeps commented 4 years ago

Thanks a lot! I checked the new binary with my rather large biblatex collection (50k lines) and there is only one (minor) issue with {IEEE} {CG}\&{A} (short title of the IEEE Computer Graphics and Applications journal): it becomes {IEEE} {CG}\&A. Not sure if it makes sense to hunt this particular issue down... personally, I can work around it, e.g. {IEEE} {CG\&A}.

plk commented 4 years ago

Hmm, that's a strange edge case indeed - there really should be a space there but that wouldn't help with something legitimate like \&\;{A}. Please try 2.15 now - I think this should also be resolved now.

moewew commented 4 years ago

It is astonishing how many edge cases come out of the woodwork here. All my tests look great (but it is becoming more and more obvious my tests don't even scratch the surface of what people have in their .bib files).

Just for the record A\&B is valid TeX and differs from A\& B in output. Space is only removed after control words consisting of (TeX) letters (characters of catcode 11) and the control space \. Control sequences consisting of a single non-letter character do not skip the following space.

\documentclass{article}

\begin{document}
A\&B

A\& B

A\,B

A\, B
\end{document}
schoeps commented 4 years ago

Please try 2.15 now - I think this should also be resolved now.

Yes, perfect! I confirm that the new version does not strip any intended bracket. By the way I am using 2.15dev for quite some time now and I am very happy with it.... Thanks!

gerking commented 4 years ago

Using biber 2.15, I'm still facing the problem of #297 with double names such as Franz{-}Josef. Unfortunately, these braces are used by default in *.bib files exported from DBLP.

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{csquotes}
\usepackage[backend=biber]{biblatex}

\begin{filecontents}[force]{\jobname.bib}
@book{DBLP:books/daglib/0033267,
  editor    = {J{\"{u}}rgen Gausemeier and
               Franz{-}Josef Rammig and
               Wilhelm Sch{\"{a}}fer},
  title     = {Design Methodology for Intelligent Technical Systems, Develop Intelligent
               Technical Systems of the Future},
  publisher = {Springer},
  year      = {2014},
  url       = {https://doi.org/10.1007/978-3-642-45435-6},
  doi       = {10.1007/978-3-642-45435-6},
  isbn      = {978-3-642-45434-9},
  timestamp = {Tue, 16 May 2017 14:01:41 +0200},
  biburl    = {https://dblp.org/rec/books/daglib/0033267.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
\end{filecontents}

\addbibresource{\jobname.bib}

\begin{document}
\nocite{DBLP:books/daglib/0033267}
\printbibliography
\end{document}
moewew commented 4 years ago

@gerking Please see https://github.com/plk/biber/issues/329. But really I think these braces are excessive when they are added to all entries with hyphenated given names.

gerking commented 4 years ago

@gerking Please see #329. But really I think these braces are excessive when they are added to all entries with hyphenated given names.

Thanks, and sorry for duplicating.