Open pauloney opened 2 months ago
Hmmm, we already have some .lbx
files that require UTF-8 and in theory I like UTF-8 more than TeX macros (plus that's what I usually recommend for .bib
files as well). But there might still be people who use other encodings than UTF-8 that are not backwards compatible with UTF-8 and those people could be in trouble if we recode existing files.
As for BCP codes you are aware of the discussions in https://github.com/plk/biblatex/issues/160 and https://github.com/plk/biblatex/issues/961 and also https://github.com/plk/biblatex/issues/1362. At the moment everything we do in .lbx
files is based around babel
identifiers (so the language names we use are the ones used/supported [possibly historically] by babel
), because that made things straightforward to implement with babel
. We have polyglossia
support bolted onto this by bugging the polyglossia
developers to add "babel
translations". If we wanted to use a different system, we'd have to translate BCP-47 to babel
identifiers - and that would just be more work for no real gain on our end. That said, babel
did introduce an alternative system of loading languages (https://github.com/plk/biblatex/issues/1362), which we can't deal with at the moment, so we might have to think about a different scheme at some point. But we would probably need an effort by all localisation packages (babel
, polyglossia
) to introduce a suitable interface that works well in all cases. I imagine something (obviously because it is less work for us) where packages can use a standardised language name (BCP-47, some ISO, what have you) and babel
and polyglossia
then translate this into their respective internal names as needed.
I'd say we need (as a community) to be making these kind of moves: see what we did with the kernel to shift to UTF-8 but leave a way 'out'. Really there should be very few non-UTF-8 docs nowadays, and people probably need to be told that they have to be re-encoded.
BCP-47 is more tricky, but it's what we've moved to for the language-dependent code in expl3
, so I think it's also the direction of travel.
Perhaps I should ask other team membrs to take a look too?
I agree that UTF-8 is the way to go, but if we were to recode existing files here, old documents in latin1
or what have you could break with no easy way out except recoding. In this case I don't really see a huge advantage of recoding files to UTF-8 for us, so I think it would be a net negative. For new features I'm all for using only UTF-8 as it is usually easier to digest and makes things easier (or even possible).
BCP-47 would be great if there were an interface we as package maintainers could use. At the moment we have to piece together babel
and polyglossia
interfaces and sometimes internals...
BCP-47 would be great if there were an interface we as package maintainers could use. At the moment we have to piece together
babel
andpolyglossia
interfaces and sometimes internals...
We've made a start with \BCPdata
, this mechanism likely needs extending - my view is that we should in the end have the core data all managed by the kernel, with both babel
and polyglossia
using defined kernel interfaces to add/read.
That said,
babel
did introduce an alternative system of loading languages (#1362), which we can't deal with at the moment, so we might have to think about a different scheme at some point.
One of the goals in babel
is to deal with automated workflows, so that generated files can be loaded and run even without explicit language declarations in the preamble. In other words, if a generated file says something like \foreignlanguage{chinese}{..}
, then the Chinese locale is loaded. Also, \babelfont
predeclares fonts ‘just in case’, so that they are loaded only if necessary. There is a lot of room for improvements, so feel free to open an issue in babel
to explain what you would need and discuss them.
There is a lot of room for improvements, so feel free to open an issue in
babel
to explain what you would need and discuss them.
Thanks, I will once I have the time to investigate this properly.
For now I can probably only give you the big picture: We need a way to interface with language switching using a fixed naming system (so either using babel
names like british
or some BCP/ISO thing), so that no matter which system/names the user uses to call a language we always do the same. For the most part it would probably be enough if we could add stuff to the language captions and language extras (and noextras) (in old babel
speak) and find out about the currently active language as well as the main document language (if any). If the new system does not distinguish between captions, extras and noextras, we would have to come up with a suitable alternative plan to ideally make things as uniform as possible (for the most part I think the distinction between captions and extras for biblatex
is artificial and we could just as well combine the two, but we need the noextras cleanup, so ...).
What made https://github.com/plk/biblatex/issues/1362 difficult for us is that we're used to users saying they want a language called english
and then we load a file called english.lbx
and add code to captionsenglish
, extrasenglish
and noextrasenglish
. From then on whenever the user switches to English our translations and extras get used as well. If the user uses \babelprovide[import=en]{quack}
, we never get to see english
, everything we have is en
and we need a way to know that this is english
and then we need the equivalents of adding translations for certain strings (as in captions) and more or less arbitrary init code.
Basically \babelprovide[import=en]{quack}
means ‘create a new language named quack
and import babel-en.ini
’. It doesn’t necessarily mean quack
is an english
language. But let’s assume it is. I’m still not yet sure what you want or need. Perhaps something like this?:
\documentclass{article}
\usepackage[english]{babel}
\babelprovide{spanish} % import=es by default
\babelprovide[import=es]{medievalspanish}
\babelprovide[import=es]{classicalspanish}
\LocaleForEach{%
\getlocaleproperty\babelname{#1}{identification/name.babel}%
\message{^^J --- #1 === \babelname}}
\stop
(Some locales provide several babel names. The preferred one is the first.) You may want to modify the imported file:
\babelprovide[import=es,
identification/tag.bcp47 = es-x-medieval,
identification/extension.x.tag.bcp47 = medieval]
{medievalspanish}
Would it be possible to convert the LBX files that are still in pidgin-TeX to UTF-8, and also change their name to BCP-47?
I ask because we do use the files in other suites like "text2bib", "bbl2bib", ... and have been thinking about using it "on the fly" so that when a new supported languages comes up in BibLaTeX, we automatically support it.
The change of names to BCP-47 would help us (easierly) identify families of laguages and support them appropriately, which is one of the reasons for BCP-47 existence.
I can do the work of the conversion of the files myself, especially if we have a suite of tests for before/after the change, but I do not understand how the files are called by biblatex.