Change from pidgin-TeX to UTF-8

pauloney commented 2 months ago

Would it be possible to convert the LBX files that are still in pidgin-TeX to UTF-8, and also change their name to BCP-47?

I ask because we do use the files in other suites like "text2bib", "bbl2bib", ... and have been thinking about using it "on the fly" so that when a new supported languages comes up in BibLaTeX, we automatically support it.

The change of names to BCP-47 would help us (easierly) identify families of laguages and support them appropriately, which is one of the reasons for BCP-47 existence.

I can do the work of the conversion of the files myself, especially if we have a suite of tests for before/after the change, but I do not understand how the files are called by biblatex.

moewew commented 2 months ago

Hmmm, we already have some .lbx files that require UTF-8 and in theory I like UTF-8 more than TeX macros (plus that's what I usually recommend for .bib files as well). But there might still be people who use other encodings than UTF-8 that are not backwards compatible with UTF-8 and those people could be in trouble if we recode existing files.

As for BCP codes you are aware of the discussions in https://github.com/plk/biblatex/issues/160 and https://github.com/plk/biblatex/issues/961 and also https://github.com/plk/biblatex/issues/1362. At the moment everything we do in .lbx files is based around babel identifiers (so the language names we use are the ones used/supported [possibly historically] by babel), because that made things straightforward to implement with babel. We have polyglossia support bolted onto this by bugging the polyglossia developers to add "babel translations". If we wanted to use a different system, we'd have to translate BCP-47 to babel identifiers - and that would just be more work for no real gain on our end. That said, babel did introduce an alternative system of loading languages (https://github.com/plk/biblatex/issues/1362), which we can't deal with at the moment, so we might have to think about a different scheme at some point. But we would probably need an effort by all localisation packages (babel, polyglossia) to introduce a suitable interface that works well in all cases. I imagine something (obviously because it is less work for us) where packages can use a standardised language name (BCP-47, some ISO, what have you) and babel and polyglossia then translate this into their respective internal names as needed.

josephwright commented 2 months ago

I'd say we need (as a community) to be making these kind of moves: see what we did with the kernel to shift to UTF-8 but leave a way 'out'. Really there should be very few non-UTF-8 docs nowadays, and people probably need to be told that they have to be re-encoded.

BCP-47 is more tricky, but it's what we've moved to for the language-dependent code in expl3, so I think it's also the direction of travel.

Perhaps I should ask other team membrs to take a look too?

moewew commented 2 months ago

I agree that UTF-8 is the way to go, but if we were to recode existing files here, old documents in latin1 or what have you could break with no easy way out except recoding. In this case I don't really see a huge advantage of recoding files to UTF-8 for us, so I think it would be a net negative. For new features I'm all for using only UTF-8 as it is usually easier to digest and makes things easier (or even possible).

BCP-47 would be great if there were an interface we as package maintainers could use. At the moment we have to piece together babel and polyglossia interfaces and sometimes internals...

josephwright commented 2 months ago

BCP-47 would be great if there were an interface we as package maintainers could use. At the moment we have to piece together babel and polyglossia interfaces and sometimes internals...

We've made a start with \BCPdata, this mechanism likely needs extending - my view is that we should in the end have the core data all managed by the kernel, with both babel and polyglossia using defined kernel interfaces to add/read.

jbezos commented 1 month ago

That said, babel did introduce an alternative system of loading languages (#1362), which we can't deal with at the moment, so we might have to think about a different scheme at some point.

One of the goals in babel is to deal with automated workflows, so that generated files can be loaded and run even without explicit language declarations in the preamble. In other words, if a generated file says something like \foreignlanguage{chinese}{..}, then the Chinese locale is loaded. Also, \babelfont predeclares fonts ‘just in case’, so that they are loaded only if necessary. There is a lot of room for improvements, so feel free to open an issue in babel to explain what you would need and discuss them.

moewew commented 1 month ago

There is a lot of room for improvements, so feel free to open an issue in babel to explain what you would need and discuss them.

Thanks, I will once I have the time to investigate this properly.

For now I can probably only give you the big picture: We need a way to interface with language switching using a fixed naming system (so either using babel names like british or some BCP/ISO thing), so that no matter which system/names the user uses to call a language we always do the same. For the most part it would probably be enough if we could add stuff to the language captions and language extras (and noextras) (in old babel speak) and find out about the currently active language as well as the main document language (if any). If the new system does not distinguish between captions, extras and noextras, we would have to come up with a suitable alternative plan to ideally make things as uniform as possible (for the most part I think the distinction between captions and extras for biblatex is artificial and we could just as well combine the two, but we need the noextras cleanup, so ...).

What made https://github.com/plk/biblatex/issues/1362 difficult for us is that we're used to users saying they want a language called english and then we load a file called english.lbx and add code to captionsenglish, extrasenglish and noextrasenglish. From then on whenever the user switches to English our translations and extras get used as well. If the user uses \babelprovide[import=en]{quack}, we never get to see english, everything we have is en and we need a way to know that this is english and then we need the equivalents of adding translations for certain strings (as in captions) and more or less arbitrary init code.

jbezos commented 1 month ago

Basically \babelprovide[import=en]{quack} means ‘create a new language named quack and import babel-en.ini’. It doesn’t necessarily mean quack is an english language. But let’s assume it is. I’m still not yet sure what you want or need. Perhaps something like this?:

\documentclass{article}

\usepackage[english]{babel}

\babelprovide{spanish}      % import=es by default
\babelprovide[import=es]{medievalspanish}
\babelprovide[import=es]{classicalspanish}

\LocaleForEach{%
  \getlocaleproperty\babelname{#1}{identification/name.babel}%
  \message{^^J --- #1 === \babelname}}

\stop

(Some locales provide several babel names. The preferred one is the first.) You may want to modify the imported file:

\babelprovide[import=es,
    identification/tag.bcp47 = es-x-medieval,
    identification/extension.x.tag.bcp47 = medieval]
  {medievalspanish}

plk / biblatex

Change from pidgin-TeX to UTF-8 #1364