Open pauloney opened 10 years ago
General instructions can be found at the old SF wiki for biblatex (edit that file is now in an updated version on the GitHub wiki, please use: https://github.com/plk/biblatex/wiki/Checklist-for-submitting-a-new-localisation-file-(.lbx)). For testing see example 03-localization-keys
in the documentation. You can use english.lbx
as a starting point. Just complete \DeclareBibliographyStrings
; we take care of the rest on the basis of your answers to the questions listed in the wiki.
Languages based on the the Latin alphabet should be encoded in Ascii. That way they will be supported by any backend (BibTeX variants and biber).
Regarding your other questions:
X\adddotspace Y
is less likely to have a linebreak between "X." and "Y" than X\adddot\ Y
.
Decisions on whitespace and punctuation are ideally be made by the translator. Refer to the manual sections on adding punctuation and whitespace for further details.country*
, patent*
and patreq*
strings you can imagine that this can get unwieldy.Audrey, Thanks for the quick answer. I'll build the framework that takes to produce some of the several lbx files that will be necessary to really i18n Biblatex. The problem of working with a single lbx file at a time is that you can't compare to nearby languages or change your mind later about a better translation - because after you written a token, the original is gone. I'll read in all existing lbx files and the produce the database that will be needed to drive the manufacturing of the new ones. There are 45 languages in Babel which are not in Biblatex, so it will take an organized effort of more than just programmers to get there.
I understand the problems of keeping track of changes (via GitHub) and testing are serious, but I'll produce the files and send them ready to you.
Supporting all country* strings is fairly easy! I already have most country names in some 200 languages, so I'll produce the files and make them available to you, if you want to use... but I would strongly recommend the use of a separate file for that - so as not to overload the lbx's.
Before I get started I have one small question. You mention the use of ".\isdot" on the Wiki page, but I do not see any occurrence of that on any of the lbx's. Is that really necessary at this level?
Thanks! Paulo Ney
I'm for doing this if we can have a way for the releaser(s) (which is currently me) to generate all current .lbx files for a release on demand. I would prefer something like a db and a pull interface in Perl (biber is all in perl ...) which generates the .lbx. If the db was something like SQLlite, the db could also be in the biblatex git repo. The problem then is that contributors would either still send diffs against the generated .lbx or would need to look at the db, which is probably out of the question for most people. Text files are easier for this but, as you say, we don't get tehe coverage or consistency we need in future.
Philipp Lehman wrote the wiki page. I'm not sure what use-case he had in mind for .\isdot
. AFAIK it isn't necessary. You could consider using \isdot
in place of \adddot
if the string preceding is the output of some command, which may or may not end with a period.
About overloading the lbx files or separate files for country-specific strings - this is what I meant by "unwieldy". Certain aspects of the core biblatex styles are demonstrative rather than exhaustive. This is one good example. Users can easily extend the lbx files. If you're wanting to share all those extra strings with others, consider an add-on package.
The DB/spreadsheet could be maintained similar to the localization keys document - just an extra resource, but not necessary for contributing lbx files. Note that only a fraction of the current lbx files are actually complete, so between-language comparisons are limited.
It will take some building, but I think it is the only way to go! Imagine making a structural change and have to change/test 45 lbx files! I also want to build an interface so one can choose 3 to 4 languages to compare/edit the DB - as the coverage get bigger that will be more important.
Changes to lbx's made by users or entered directly on GiHub should integrate easily with the DB back... because those will continue to happen!
I'll take a detour and come back when I am able to generate all the current 20 essential lbx files exactly the way they are right now.
PN
I am almost done with the back-end to produce the lbx files from a DB. I can produce lbx files that are almost identical to the existing ones and get some 50 more languages in the fray.... the problem here will be to get Babel to do the same thing ...but at this point I have an important question:
Why are we using a separate i18n LBX set of files, if we could use the ones from the CSL project at
https://github.com/citation-style-language/locales
In my (uninformed) way to view it, there are plenty of reasons to use it instead of the lbx's:
Paulo Ney
I like the idea of using standards like this but there are some things to consider though:
Answering each of your questions/comments:
PN
Well we could consider the CSL route later if they were more to our needs but currently, they're not really. I had this argument with the "generic bib system" people a few years ago - they didn't seem to understand that high-quality bib typesetting needs semantic integration into the typesetting - there is no good "generic" solution ... If you can generate identical .lbx files to our current ones, let's discuss further ... which database are you using?
I can produce identical lbx's already. When they differ, it is because the original lbx's have something wrong - a space out of place, etc ...
I am using MySQL because at the moment is what I have in one particular server that I am interacting with someone lese on the project, but writing very generic code that could be changed to anything.
I would like to add that one more advantage of doing this via the DB, is that you then can interface with people all over, which are interested in i18n of biblatex. They would just need to enter the data in a interface and their lbx files could be exported and later included in the distribution.
Ok - what language are you using for data extraction and creation of .lbxs?
Perl.
Good. Biber is all in perl too. Perhaps you could send me a MySQL dump and the perl? I'd like to have a look at it.
Sure! Give me sometime to wrap it up ... I am sorting the issues with translations in to languages that have "gender" right now (so I can parse in the XML) and sort a few other edges and send you the stuff. It is just one script.
No rush, many thanks. We'd then have to think about hosting this in some way or perhaps using SQL lite and keeping just a db file in the git repository etc.
One thing I realized today writing the maps to parse the XML files of CSL, is that they have a nice way to recognize the gender and number (singular or plural) of words in other languages that is NOT present in the lbx file structure!
To translate a phrase like
Translated and Annotated by ...
to languages like Portuguese and Spanish requires one to know the gender of the entity being translated and annotated. If it is a book or a an Album will be masculine, but it if is is a Collection or a Thesis it will be feminine. So I don't really see how this could be done in the realm of the current lbx's files.
Would someone mind sharing the wisdom on how these problems with be dealt with ?
PN
@aboruvka - do you have a comment on this?
Gender specific strings come up with idem*
. These can be selected on the basis of the gender
field.
idemsf
feminine singular form of idem
idemsm
masculine singular form of idem
idemsn
neuter singular form of idem
idempf
feminine plural form of idem
idempm
masculine plural form of idem
idempn
neuter plural form of idem
idempp
plural form of idem suitable for a mixed gender list of names
Some languages use masculine or feminine ordinals depending on the gender of item being indexed (e.g. series or edition). These are handled on the translator's end with the bibliography "extras" questions I mentioned earlier.
For the "by" roles, you could simply add gender/number-specific variants provided that the gender/number of the work is strongly tied to the entrytype (e.g. @book
entries are always masculine-singular, @mvbook
masculine-plural, @collection
feminine-plural, etc). Note that album entrytypes are not formally supported and the @thesis
entrytype doesn't support the role fields (only one person works on a thesis anyway).
The same problem has been mentioned in #48 for non-"by" roles, where the gender/number would be specific to the people filling the role. The strings already consider number because this is available in name list processing. Gender would have to be indicated explicitly in the entry somehow.
Thanks! That should do it.
Not quite. There is work on our end to be done. The bibliography extras questions would also need expanding to ask about the gender and number of @article
, @book
, @mvbook
, @inbook
, @collection
, @incollection
, and @mvcollection
.
I'm saying it is probably do-able, but we have to consider work required to get this done, the relative demand for the new feature, and potential issues the feature might open up. If PL knew about this limitation and decided not to implement it, he likely had a very good reason.
PLK, Audrey, I am down to the wire, and about to start the last upload to the db and the last series of tests. Should I grab a set of fresh lbx files from the development branch ? Or use the last public release?
Always grab from DEV - it's more up to date ...
One of the hardest things I had to deal with in this side project was the fact that "language" and "locale" are mixed inside BibLatex in some unreasonable ways. It is true that most of what in inherits (or uses) from Babel is in the form of language, but the LBX files contain so much about "locale" that is impossible to do it all in the realm of language only.
When one say that an entry should have "hyphenation = {portuguese}" that is all good and okay, but the entry:
language = {portuguese}
should never be expected format an entry properly because Iran, Bahamas, Kazakhstan, ... are written in one way in pt_PT and in another way in pt_BR.
In order to circumvent my difficulties introducing the translated terms in a DB and importing some new ones I had to literally introduce locales in my table of languages and vice-versa... something a programmer should never have todo!
Now that internationalization is really coming, in order to manage this well and be able to expand in the realm of languages that have many many locales it would be nicer to split this two roles well. I know that, for Portuguese alone there is a portuguese.lbx, portuges.lbx, brazil.lbx and brazilian.lbx - but it is extremely hard to maintain in the way it is laid out, eliminate duplicate and deal with inconsistencies. One should have a unique file "portuguese.lbx" and a couple additional pt-BR.lbx and pt-PT.lbx that should call the main one and define some small local components.
Labeling of language and locale should follow standards (ISO and IETF) so one can interchange with other Bibliography management software and compatibility with the name space of Babel should be an internal issue and the user should never have to deal with that at a bibliography entry level.
Just my 2cents!
Paulo Ney
With the 2.8 DEV branch, I'm moving away from the hyphenation
field and re-naming it langid
since that's what it is - it's a language ID in babel (or, with 2.8, polyglossia too). There will be a langidopts
for specifying polyglossia language options like variant names ("american" and "british" for the langid "english" etc.). The language
field is just a printed field - not used to localise anything - it's misleading, I agree.
Lines 461-462 of the english.lbx file have a curious entry:
countryeu = {{European Union}{EU}}, countryep = {{European Union}{EP}},
can anyone tell me what the second line means ?
Paulo Ney
I should have said that I saw this:
\keyitem{countryeu} The name
in the examples, but I continue puzzled by the meaning of it...
Paulo Ney
Good question - @aboruvka - any idea? It looks to me like a copy-paste which should read:
countryep = {{European Patent}{EP}},
?
No idea. I don't think it is a mistake, though, because then countryep
would be redundant with patenteu
.
I am not sure I understand your phrase! It is redundant, but you don't think it is a mistake ?
Hi People! I am mostly done with the framework to deal with the translations, and I am able now to write "identical" LBX files and at the same time use the DB to do the wonderful things I mentioned, like:
In doing so, there are always a few choices here and there, on the next few e-mails I'll report on the most important to make sure you all agree with them. Then later I have a few questions o what is the preferred way to write the files, etc ...
If this is not the correct place for this, please le me know!
Paulo Ney
The first issue had to do with the standardization of the way the 'FIXME's were entered in the LBX file. Short of writing my own TeX parses in sed, I choose to standardize the files before parsing it. Some of the issues could be bugs in the lbx files, more on it down below. I am laying them down in detail because if you diff the files you will see this difference.
Most commented out strings had a FIXME tag - but not all (for example, danish.lbx has 13 items commented out and without a FIXME marker and swedish.lbx has 1). I changed that so they all get a uniform FIXME tag.
Then every file has a % ending a line that is not finished, all but greek. So I added it to the greek.lbx file. This also changed a few lines in the Russian:
mathesis = {{дис\adddotspace\textellipsis\ маг\adddot}
{дис\adddotspace\textellipsis\ маг\adddot}},
phdthesis = {{дис\adddotspace\textellipsis\ док\adddot}
{дис\adddotspace\textellipsis\ док\adddot}},
candthesis = {{дис\adddotspace\textellipsis\ канд\adddot}
{дис\adddotspace\textellipsis\ канд\adddot}},
Norwegian:
editorco = {{redakt{\o}r og kommentarer}
{red\adddotspace og komm\adddot}},
editorsco = {{redakt{\o}rer og kommentarer}
and Catalan:
byeditorcoin = {{edici\'o, comentaris i introducci\'o a cura \smartof}
{ed.,\addabbrvspace com\adddotspace i intr\adddotspace\smartof}},
byeditorcofo = {{edici\'o, comentaris i pr\`oleg a cura \smartof}
{ed.,\addabbrvspace com\adddotspace i pr\`ol\adddotspace\smartof}},
The greek file is missing a , at the end of the lines:
bycompiler = {{σύνταξη υπό}{σύνταξη υπό}}
byfounder = {{αρχική δημιουργία από}{αρχική δημιουργία από}}
bycontinuator = {{συνέχεια από}{συνέχεια από}}
bycollaborator = {{συνεργασία από}{συνεργασία από}}
withcommentator = {{υπομνηματισμός υπό}{υπομνηματισμός υπό}}
langamerican = {{Αγγλικά}{Αγγλικά}}
Since they are all going to be written by the DB report writer, they all will have the , like the other ones. (Is this a bug? Or is the comma optional?
Same thing in the "catalan" file.
Most FIXME tags and Observation tags were written after the closing of the entry - some not - and where located, as a comment, in the middle of the field, like in:
editorco = {{obrada i komentari}% gender neutral
{obrada i komentari}},
I moved the remarks to be written always to the end of the field as in:
editorco = {{obrada i komentari}
{obrada i komentari}}, % gender neutral
The comments are entered in a "note" filed in the db and preserved for future reference and for possibly betterning the quality of the translation. There is a chance that someone will want to make a comment on the first part of the field, but that is a minor issue - the comment is always free form and can contain that specification. Files affected by this change are the "finish.lbx"
The french file had non-standard breaks like:
bycollaborator = {{avec la collaboration \smartof}{avec la
coll\adddotspace\smartof}},
The db writes all of them with a standard break at the end of the field now.
I have no idea what to do with with comments lose in the file like this on in "czech.lbx":
% V pripade potreby pouzit lokalni lbx soubor
One question:
I see that the construct
inherit = {german},
is used to organize the Austrian/German and the two Norwegian files and works inside: "\DeclareBibliographyStrings{", along with
\InheritBibliographyExtras{german}
that works from outside
Then the contruct:
\InheritBibliographyExtras{english}
\InheritBibliographyStrings{english}
is used to organize the 5 varieties of English (american.lbx, australian.lbx, british.lbx, canadian.lbx, canadian.lbx, newzealand.lbx) in a bit of contorted way:
The file UKenglish.lbx points to
\InheritBibliographyStrings{british}
and that points to:
\InheritBibliographyStrings{english}
and the USenglish.lbx file has the same sequence, but passing via another different file "american.lbx" and ending in the same place. It would be way better to name the files:
en.lbx en-UK.lbx en-US.lbx en-AU.lbx en-CA.lbx
and they lang-COUNTRY.lbx files point to the lang.lbx (en.lbx) for \InheritBibliographyStrings without passing via intermediary files. Then a few sym links could provide backwards compatibility!
And NONE of it is used to organize the Portuguese files that share some 250 terms! Any reason for that ?
Paulo Ney
Questions on factorization, naming and organization of the files.
I am assuming that the "factored" way of the English and Norwegian files are the preferred way. There is one file for the language itself (norwegian.lbx) and two more localized (norsk.lbx and nynorsk.lbx) that refer to it. So I'll change the Portuguese ones to be factored as well. One language file will contain all common terms and the locale specific files will contain the local differences.
In this way the two files country-specific files will quite small (some 15 to 30 entries only) and a common language file (that does not exist at the moment) will show up. But my biggest issue here is with the names. It will be a mess to carry the current naming scheme beyond the small language set of 18 languages we have right now - specially when you take in consideration that certain languages (eg. Azeri) can be written in many scripts (Arabic, Latin and Cyrillic) in multiple countries (Azerbaijan, Iran, ...) and in itself would require various files named in weird ways.
The correct way to do this is to name them as:
la-Scrp-CO
where "la" is the ISO language code (2- or 3-letter), "Scrp" is the 4-letter ISO code for the script of the language (if necessary to distinguish it) and CO is the ISO 2-letter code for the location. Examples:
en-US
az-Arab-IR
zh-Hant-TW
bg-BG
pt
pt-BR
pt-PT
and the set of 33 files we have right now sometimes named after a language (catalan, ngerman,...), sometimes after a country (american, austrian, ...), sometimes wrongly named (portuges.lbx), sometimes with caps (UKenglish.lbx), sometimes not (british.lbx)... would just become symm links pointing to the real thing.
Preferably not even the sym links would be visible and this could be moved to an internal file named
compatiblity_with_babel
and the user will not be exposed to it.
PN
And the final one of today on organization of names of countries.
I guess anyone that has seen this list
countryde = {{Germany}{DE}},
countryfr = {{France}{FR}},
countryuk = {{United Kingdom}{GB}},
countryus = {{United States of America}{US}},
feels that it is lacking in a few ways: A bit short, a bit culture-centric, somewhat random, it does not cover the 4 largest publishing countries in the world (missing China - #2 and Russia - #4),...and it does not even cover the name of the countries we support in the LBX files themselves. Above all, having one country translated and not another in a Bibliography is quite weird - specially when the language is written in a different script!
I have one book that cites standard math-books in many different languages and one weird thing there is to have a "United States" in the middle of a perfectly formatted Katakana entry.
The database contains the names of all 250 countries/locales in some 180 combination of languages and scripts with possible support for 500 locales. So one could easily include the name of all countries in all the languages that we support, and since humans are not supposed to be reading/editing LBX, XML files anyways - this is quite okay. If someone feels that it decreases the readability/editability of the lbx file than one could move it apart into a set of files:
countries-en-US
countries-az-Cyrl-AZ
countries-pt-BR
and while w are at it we could also address this problem:
countryeu = {{European Union}{EU}},
countryep = {{European Union}{EP}},
Paulo Ney
@aboruvka is the real expert on this - but it looks like what you are doing is rather nice and I will help in any way I can to integrate this. One problem I see so far is that symlinks aren't going to work on Windows ... we'd have to do backwards compat in some TeX wrapper I think.
lbx filenames follow babel identifiers, so these should not be altered. I've already shared my comments on the country strings - core support is demonstrative, not exhaustive and I don't think we should change this.
I would assume PL created the English files. Portuguese files were contributed, and the author likely wasn't aware of inheritance to avoid redundancy.
The problem is essentially that biblatex loads the .lbx files based on the name passed to the "babel" option (this will be renamed to "autolang" anyway because polyglossia support is now working and so we need to move away from babel-specific naming for options). So, we would need some sort of mapping inside biblatex to handle any .lbx renaming. The problem is that we don't control the babel/polyglossia language names which are also passed the value of the babel/autolang option in order to change hyphenation patterns etc. This is a bit tricky.
I meant to spend some time this last week researching how symlinks work on Windows. but did not have the time. I understand that it is fully supported in Windows 7, 8 and Vista, but patchly supported in XP and 2000, and a total mess over remotely mounted files systems, so most software packagers stay away from it.I understand we have to follow Babel because of internal issues but coordinating the name space with Babel and Polyglossia is not a far fetched idea ...
Speaking about that, what will be the future of Babel inside BibLaTeX once Polyglossia is supported ? The idea is that Babel will continue to be supported ?
Glad to hear that the internal mapping of the filenames will be necessary even for other reasons. We may one day go away from "portuges" dictated by DOS! It took me more time to identify the languages of all these packages (babel, polyglossia and biblatex) than to write a parser for their files. Reading CSL files was easy because they are named after an ISO standard.
Assuming that you guys are okay with everything else that I raised except for the Countries and Naming, where should I sent the initial set of files for testing?
Paulo Ney
Babel will continue to be supported. We need to think a bit about how this would fit into a release workflow unless you are going to be available as a hub for .lbx generation and release? The naming thing also need some thought. Internally we need names for two things, .lbx file names and babel/polyglossia language names for switching. I haven't had time to think about this yet and won't for a few weeks. I can look at the files etc. but it will be in November.
I can perfectly serve as the hub for the generation of the lbx files at release time, no problem.You decide on the naming because there are some internal issues of TeX and cross support and a bit of a religious war wagging it seems. Let me know what they will be and I'll generate them with ISO names and rename it wit a shell script.
The sole fact that BibLaTeX is going to exchange names with Babel and Polyglossia at the same time and the fact that the name space of these two programs is DIFFERENT already says we need a common way to call and load the files! The last thing one needs here is two different ways to load "USEnglish" if you are using Babel or Poly, if not for the sake of the user, at least for your sake of having to deal with it internally.
I was looking at some work that Javier did on language loading for the release this last weekend and it seems darn easy to add an alternate name to load a language in Babel - preserve everything else and add an alternate name. It looks so easy that even I fell to confident I can provide a patch for that, so I am going to ask him and Arthur about providing this.
This is true - we already have alternative names in babel/polyglossia. Perhaps if they would accept the ISO names too, it would make life a lot easier ...
I wonder is someone would elaborate on the de-facto use of
\adddot
\adddotspace
\addcomma
\addcolon
The manual says:
4.7.3 Bibliography and citation styles should always use
these commands instead of literal punctuation marks.
but I see lots of literal punctuation marks ( . , : ) in the lbx files.
Paulo Ney
In most cases the punctuation commands in the string definitions are preferable. The lbx files are contributed and in most cases only lightly edited by the maintainers, so they don't always adhere to best practices.
Are there any difference in between the constructs:
\DeclareBibliographyStrings{
inherit = {german},
and
\InheritBibliographyStrings{german}
used in the structure of the lbx files ? PN
Both make the current localization module inherit all the string definitions from german.lbx
. The former just gives you room to define new strings or redefine some existing ones.
Is there a test sequence for BibLaTeX ? Would running the examples and comparing the results be a good test ? I have finished factoring the locale files and wanted to replace the old one and test to see if I get errors or the same PDF files.
Paulo Ney
I finished with the first part of the work (db infrastructure, parser and lbx-file-writer), and the new files are here:
https://drive.google.com/file/d/0B3mOBzjP3W1nMW9tYUQwcFdMWDQ/edit?usp=sharing
There are 3 new files (en, de and pt) with termination .lang for lack of a better choice. They contain the factored strings of the locales that use these 3 languages. I refrain from doing any other changes to the files, so even things that are wrong (like the \addot on line 364 of portuguese.lbx) have been left as-is, so you could more easily check that nothing has been added or lost during this process. I tried to do some testing on my own, but that is harder for me, I include the files back (inside each other) and processed the examples and they all seem to be fine. Hopefully you can test it better.
When I get a green-light on the files, I'll move to fix the things that are wrong with individual translations, etc ...
I can now analyze and compare the translations and more easily complete the terms that are missing and produce the files for new languages.
Paulo Ney
There is a test suite in the git repository but it doesn't verify identical PDF output, just whether there were any errors. I'm not sure hoe many of the examples actually test language files though. I will probably find time to look next week.
I am trying to build testing sequence specific to the lbx files, starting from the examples, and there is one that does not run on TeXLive 2013:
03-localization-keys
with an error
(/usr/local/texlive/2013/texmf-dist/tex/latex/latexconfig/color.cfg
! Missing = inserted for \ifnum.
<to be read again>
\bgroup
l.13 }
%
? e
I loaded and processed now the files from the development version (the previous set had the files of TeXLive 2013), and generated a new set of file, that you can find at:
https://drive.google.com/file/d/0B3mOBzjP3W1nZmptam1lN1I2NkU/edit?usp=sharing
I tested it extensively ...
So I assume these are ready to be checked in, as soon as you can provide a facility for one file to include the other, in the case of the 3 factored languages (en, pt, de).
No strings have been modified, this is just the modification of the lbx due to the new framework, but where the original lbx files had incorrect syntax (missing commas, etc ..) this has been fixed. The most affected files where greek and catalan, but some other ones have been changed as well. If you want a short blurb to explain the changes at check in time, it could be:
Files are now written from a db for correct syntax, string comparison, language factoring and translation development.
Then on the next releases I'll move to:
PN
Is there are "template" one can use to make the translations to be used in language.lbx? Or should that be done on top of one of the existing files?
I would like to create the files for Romanian, Vietnamese, Chinese and Japanese and I do have people in the office which are capable of making the translations and have experience with Bibliographies, but NONE of them are programmers.
Also: Is there a guide on how to add a new language support ? Even though it is easy to understand what goes on inside \DeclareBibliographyStrings{ }, I would like to know when is preferable to use tex-encoding as supposed to utf8, for example?
Other questions are:
1- Can one add support for a language that is not supported by Babel?
2- When do one use \adddot and when does one use \adddotspace ?
3- Why country support (within language.lbx) is limited to Germany, EU, US, France and GB ?
4- Are you using a framework to do this? In general it is easier to manage them in a single spreadsheet with the translations to each language in each column and a script that reads the column and writes the LBX files! The translators can then easily compare to "nearby" languages and easily make other translations.
Is work by others on this kind of issue welcomed ?
Thanks for the great package! Paulo Ney