Use standardized language identifiers for lbx files

pauloney commented 10 years ago

Is there are "template" one can use to make the translations to be used in language.lbx? Or should that be done on top of one of the existing files?

I would like to create the files for Romanian, Vietnamese, Chinese and Japanese and I do have people in the office which are capable of making the translations and have experience with Bibliographies, but NONE of them are programmers.

Also: Is there a guide on how to add a new language support ? Even though it is easy to understand what goes on inside \DeclareBibliographyStrings{ }, I would like to know when is preferable to use tex-encoding as supposed to utf8, for example?

Other questions are:

1- Can one add support for a language that is not supported by Babel?

2- When do one use \adddot and when does one use \adddotspace ?

3- Why country support (within language.lbx) is limited to Germany, EU, US, France and GB ?

4- Are you using a framework to do this? In general it is easier to manage them in a single spreadsheet with the translations to each language in each column and a script that reads the column and writes the LBX files! The translators can then easily compare to "nearby" languages and easily make other translations.

Is work by others on this kind of issue welcomed ?

Thanks for the great package! Paulo Ney

aboruvka commented 10 years ago

General instructions can be found at the old SF wiki for biblatex (edit that file is now in an updated version on the GitHub wiki, please use: https://github.com/plk/biblatex/wiki/Checklist-for-submitting-a-new-localisation-file-(.lbx)). For testing see example 03-localization-keys in the documentation. You can use english.lbx as a starting point. Just complete \DeclareBibliographyStrings; we take care of the rest on the basis of your answers to the questions listed in the wiki.

Languages based on the the Latin alphabet should be encoded in Ascii. That way they will be supported by any backend (BibTeX variants and biber).

Regarding your other questions:

Not yet. The other maintainers are working on polyglossia support, but this is not an easy task.
The output of X\adddotspace Y is less likely to have a linebreak between "X." and "Y" than X\adddot\ Y. Decisions on whitespace and punctuation are ideally be made by the translator. Refer to the manual sections on adding punctuation and whitespace for further details.
From the manual: "only a small number of country names is defined by default, mainly to illustrate this scheme". If we support all possible country*, patent* and patreq* strings you can imagine that this can get unwieldy.
There are no further support files aside from the resources I mentioned above. A script might be helpful, but there are few things to keep in mind: (1) contributors are working on different platforms, (2) version control of the spreadsheet could prove challenging with multiple editors and (3) for testing, working with the lbx file directly is probably more convenient.

pauloney commented 10 years ago

Audrey, Thanks for the quick answer. I'll build the framework that takes to produce some of the several lbx files that will be necessary to really i18n Biblatex. The problem of working with a single lbx file at a time is that you can't compare to nearby languages or change your mind later about a better translation - because after you written a token, the original is gone. I'll read in all existing lbx files and the produce the database that will be needed to drive the manufacturing of the new ones. There are 45 languages in Babel which are not in Biblatex, so it will take an organized effort of more than just programmers to get there.

I understand the problems of keeping track of changes (via GitHub) and testing are serious, but I'll produce the files and send them ready to you.

Supporting all country* strings is fairly easy! I already have most country names in some 200 languages, so I'll produce the files and make them available to you, if you want to use... but I would strongly recommend the use of a separate file for that - so as not to overload the lbx's.

Before I get started I have one small question. You mention the use of ".\isdot" on the Wiki page, but I do not see any occurrence of that on any of the lbx's. Is that really necessary at this level?

Thanks! Paulo Ney

plk commented 10 years ago

I'm for doing this if we can have a way for the releaser(s) (which is currently me) to generate all current .lbx files for a release on demand. I would prefer something like a db and a pull interface in Perl (biber is all in perl ...) which generates the .lbx. If the db was something like SQLlite, the db could also be in the biblatex git repo. The problem then is that contributors would either still send diffs against the generated .lbx or would need to look at the db, which is probably out of the question for most people. Text files are easier for this but, as you say, we don't get tehe coverage or consistency we need in future.

aboruvka commented 10 years ago

Philipp Lehman wrote the wiki page. I'm not sure what use-case he had in mind for .\isdot. AFAIK it isn't necessary. You could consider using \isdot in place of \adddot if the string preceding is the output of some command, which may or may not end with a period.

About overloading the lbx files or separate files for country-specific strings - this is what I meant by "unwieldy". Certain aspects of the core biblatex styles are demonstrative rather than exhaustive. This is one good example. Users can easily extend the lbx files. If you're wanting to share all those extra strings with others, consider an add-on package.

The DB/spreadsheet could be maintained similar to the localization keys document - just an extra resource, but not necessary for contributing lbx files. Note that only a fraction of the current lbx files are actually complete, so between-language comparisons are limited.

pauloney commented 10 years ago

It will take some building, but I think it is the only way to go! Imagine making a structural change and have to change/test 45 lbx files! I also want to build an interface so one can choose 3 to 4 languages to compare/edit the DB - as the coverage get bigger that will be more important.

Changes to lbx's made by users or entered directly on GiHub should integrate easily with the DB back... because those will continue to happen!

I'll take a detour and come back when I am able to generate all the current 20 essential lbx files exactly the way they are right now.

PN

pauloney commented 10 years ago

I am almost done with the back-end to produce the lbx files from a DB. I can produce lbx files that are almost identical to the existing ones and get some 50 more languages in the fray.... the problem here will be to get Babel to do the same thing ...but at this point I have an important question:

Why are we using a separate i18n LBX set of files, if we could use the ones from the CSL project at

   https://github.com/citation-style-language/locales

In my (uninformed) way to view it, there are plenty of reasons to use it instead of the lbx's:

They are ready!
They are in XML, making it a lot easier to test consistency, etc ...

Paulo Ney

plk commented 10 years ago

I like the idea of using standards like this but there are some things to consider though:

Do they cover all of our strings?
We'd have to convert to .lbx because it's much faster for biblatex to read them since they are TeX. XML parsing in TeX is not something you ever want to do.
We'd have to make sure babel/polyglossia language ids are correct.
We'd have to support things like \adddot etc. somehow since lots of .lbx files do this.
There are special things in some .lbx files - all sorts of biblatex settings - we'd have to insert those.

pauloney commented 10 years ago

Answering each of your questions/comments:

No! There are lots in common, but coverage is different, they have some strings that we don't and same way in reverse. Interesting question is WHY ? They in fact should be almost the same since the problem is the same! :)
That settles it! I am glad have a very definite argument! :(. Instead of converting from XML --> LBX and running the danger of not having a complete lbx file back, what I am doing is parsing all LBX's and XML's files in the database, sorting out some conflicting areas by hand, and then exporting way more complete LBX files, and adding a few languages in the process.
This is an area that deserves some immediate standardization! It is wrong to do it by "language" because of the pt_PT/pt_BR, en_GB/en_US/en_CA,... discrepancies. The files should be really labeled by "locale" (which is a standard) and possibly ask the Babel/Polyglossia people to do the same. If you look at the way Babel names the files the is NO procedure in place, each one gets named at one point in time in a different way - including "portuges.lbx" that was named in this fashion (with two errors) because of the DOS restriction on filenames.
That is the case with the XML files as well since there are abbreviations that use a DOT and some that don't... unless I am missing something here.
I am dealing with it considering that every lbx file has a (fixed) pre-amble and a post-amble, and each of them gets picked up and built at the time the file is generated.

PN

plk commented 10 years ago

Well we could consider the CSL route later if they were more to our needs but currently, they're not really. I had this argument with the "generic bib system" people a few years ago - they didn't seem to understand that high-quality bib typesetting needs semantic integration into the typesetting - there is no good "generic" solution ... If you can generate identical .lbx files to our current ones, let's discuss further ... which database are you using?

pauloney commented 10 years ago

I can produce identical lbx's already. When they differ, it is because the original lbx's have something wrong - a space out of place, etc ...

I am using MySQL because at the moment is what I have in one particular server that I am interacting with someone lese on the project, but writing very generic code that could be changed to anything.

I would like to add that one more advantage of doing this via the DB, is that you then can interface with people all over, which are interested in i18n of biblatex. They would just need to enter the data in a interface and their lbx files could be exported and later included in the distribution.

plk commented 10 years ago

Ok - what language are you using for data extraction and creation of .lbxs?

pauloney commented 10 years ago

Perl.

plk commented 10 years ago

Good. Biber is all in perl too. Perhaps you could send me a MySQL dump and the perl? I'd like to have a look at it.

pauloney commented 10 years ago

Sure! Give me sometime to wrap it up ... I am sorting the issues with translations in to languages that have "gender" right now (so I can parse in the XML) and sort a few other edges and send you the stuff. It is just one script.

plk commented 10 years ago

No rush, many thanks. We'd then have to think about hosting this in some way or perhaps using SQL lite and keeping just a db file in the git repository etc.

pauloney commented 10 years ago

One thing I realized today writing the maps to parse the XML files of CSL, is that they have a nice way to recognize the gender and number (singular or plural) of words in other languages that is NOT present in the lbx file structure!

To translate a phrase like

Translated and Annotated by ...

to languages like Portuguese and Spanish requires one to know the gender of the entity being translated and annotated. If it is a book or a an Album will be masculine, but it if is is a Collection or a Thesis it will be feminine. So I don't really see how this could be done in the realm of the current lbx's files.

Would someone mind sharing the wisdom on how these problems with be dealt with ?

PN

plk commented 10 years ago

@aboruvka - do you have a comment on this?

aboruvka commented 10 years ago

Gender specific strings come up with idem*. These can be selected on the basis of the gender field.

idemsf feminine singular form of idem idemsm masculine singular form of idem idemsn neuter singular form of idem idempf feminine plural form of idem idempm masculine plural form of idem idempn neuter plural form of idem idempp plural form of idem suitable for a mixed gender list of names

Some languages use masculine or feminine ordinals depending on the gender of item being indexed (e.g. series or edition). These are handled on the translator's end with the bibliography "extras" questions I mentioned earlier.

For the "by" roles, you could simply add gender/number-specific variants provided that the gender/number of the work is strongly tied to the entrytype (e.g. @book entries are always masculine-singular, @mvbook masculine-plural, @collection feminine-plural, etc). Note that album entrytypes are not formally supported and the @thesis entrytype doesn't support the role fields (only one person works on a thesis anyway).

The same problem has been mentioned in #48 for non-"by" roles, where the gender/number would be specific to the people filling the role. The strings already consider number because this is available in name list processing. Gender would have to be indicated explicitly in the entry somehow.

pauloney commented 10 years ago

Thanks! That should do it.

aboruvka commented 10 years ago

Not quite. There is work on our end to be done. The bibliography extras questions would also need expanding to ask about the gender and number of @article, @book, @mvbook, @inbook, @collection, @incollection, and @mvcollection.

I'm saying it is probably do-able, but we have to consider work required to get this done, the relative demand for the new feature, and potential issues the feature might open up. If PL knew about this limitation and decided not to implement it, he likely had a very good reason.

pauloney commented 10 years ago

PLK, Audrey, I am down to the wire, and about to start the last upload to the db and the last series of tests. Should I grab a set of fresh lbx files from the development branch ? Or use the last public release?

plk commented 10 years ago

Always grab from DEV - it's more up to date ...

pauloney commented 10 years ago

One of the hardest things I had to deal with in this side project was the fact that "language" and "locale" are mixed inside BibLatex in some unreasonable ways. It is true that most of what in inherits (or uses) from Babel is in the form of language, but the LBX files contain so much about "locale" that is impossible to do it all in the realm of language only.

When one say that an entry should have "hyphenation = {portuguese}" that is all good and okay, but the entry:

language = {portuguese}

should never be expected format an entry properly because Iran, Bahamas, Kazakhstan, ... are written in one way in pt_PT and in another way in pt_BR.

In order to circumvent my difficulties introducing the translated terms in a DB and importing some new ones I had to literally introduce locales in my table of languages and vice-versa... something a programmer should never have todo!

Now that internationalization is really coming, in order to manage this well and be able to expand in the realm of languages that have many many locales it would be nicer to split this two roles well. I know that, for Portuguese alone there is a portuguese.lbx, portuges.lbx, brazil.lbx and brazilian.lbx - but it is extremely hard to maintain in the way it is laid out, eliminate duplicate and deal with inconsistencies. One should have a unique file "portuguese.lbx" and a couple additional pt-BR.lbx and pt-PT.lbx that should call the main one and define some small local components.

Labeling of language and locale should follow standards (ISO and IETF) so one can interchange with other Bibliography management software and compatibility with the name space of Babel should be an internal issue and the user should never have to deal with that at a bibliography entry level.

Just my 2cents!

Paulo Ney

plk commented 10 years ago

With the 2.8 DEV branch, I'm moving away from the hyphenation field and re-naming it langid since that's what it is - it's a language ID in babel (or, with 2.8, polyglossia too). There will be a langidopts for specifying polyglossia language options like variant names ("american" and "british" for the langid "english" etc.). The language field is just a printed field - not used to localise anything - it's misleading, I agree.

pauloney commented 10 years ago

Lines 461-462 of the english.lbx file have a curious entry:

countryeu = {{European Union}{EU}}, countryep = {{European Union}{EP}},

can anyone tell me what the second line means ?

Paulo Ney

pauloney commented 10 years ago

I should have said that I saw this:

\keyitem{countryeu} The name , abbreviated as \vrb{EU}. \keyitem{countryep} Similar to \vrb{countryeu} but abbreviated as \vrb{EP}. This is intended for \bibfield{patent} entries.

in the examples, but I continue puzzled by the meaning of it...

Paulo Ney

plk commented 10 years ago

Good question - @aboruvka - any idea? It looks to me like a copy-paste which should read:

countryep = {{European Patent}{EP}},

?

aboruvka commented 10 years ago

No idea. I don't think it is a mistake, though, because then countryep would be redundant with patenteu.

pauloney commented 10 years ago

I am not sure I understand your phrase! It is redundant, but you don't think it is a mistake ?

pauloney commented 10 years ago

Hi People! I am mostly done with the framework to deal with the translations, and I am able now to write "identical" LBX files and at the same time use the DB to do the wonderful things I mentioned, like:

acquire new translations
acquire translations from other Open Source projects like CSL
check on the quality of translations of each token individually
use the power of the db to complete many of the incomplete lbx files
write many more (about 150) other lbx files.
organize/name files according to ISO standards.

In doing so, there are always a few choices here and there, on the next few e-mails I'll report on the most important to make sure you all agree with them. Then later I have a few questions o what is the preferred way to write the files, etc ...

If this is not the correct place for this, please le me know!

Paulo Ney

pauloney commented 10 years ago

The first issue had to do with the standardization of the way the 'FIXME's were entered in the LBX file. Short of writing my own TeX parses in sed, I choose to standardize the files before parsing it. Some of the issues could be bugs in the lbx files, more on it down below. I am laying them down in detail because if you diff the files you will see this difference.

Most commented out strings had a FIXME tag - but not all (for example, danish.lbx has 13 items commented out and without a FIXME marker and swedish.lbx has 1). I changed that so they all get a uniform FIXME tag.

Then every file has a % ending a line that is not finished, all but greek. So I added it to the greek.lbx file. This also changed a few lines in the Russian:

  mathesis         = {{дис\adddotspace\textellipsis\ маг\adddot}
                      {дис\adddotspace\textellipsis\ маг\adddot}},
  phdthesis        = {{дис\adddotspace\textellipsis\ док\adddot}
                      {дис\adddotspace\textellipsis\ док\adddot}},
  candthesis       = {{дис\adddotspace\textellipsis\ канд\adddot}
                      {дис\adddotspace\textellipsis\ канд\adddot}},

Norwegian:

  editorco         = {{redakt{\o}r og kommentarer}
                      {red\adddotspace og komm\adddot}},
  editorsco        = {{redakt{\o}rer og kommentarer}

and Catalan:

  byeditorcoin     = {{edici\'o, comentaris i introducci\'o a cura \smartof}
                      {ed.,\addabbrvspace com\adddotspace i intr\adddotspace\smartof}},
  byeditorcofo     = {{edici\'o, comentaris i pr\`oleg a cura \smartof}
                      {ed.,\addabbrvspace com\adddotspace i pr\`ol\adddotspace\smartof}},

The greek file is missing a , at the end of the lines:

  bycompiler       = {{σύνταξη υπό}{σύνταξη υπό}}
  byfounder        = {{αρχική δημιουργία από}{αρχική δημιουργία από}}
  bycontinuator    = {{συνέχεια από}{συνέχεια από}}
  bycollaborator   = {{συνεργασία από}{συνεργασία από}}
  withcommentator  = {{υπομνηματισμός υπό}{υπομνηματισμός υπό}}
  langamerican     = {{Αγγλικά}{Αγγλικά}}

Since they are all going to be written by the DB report writer, they all will have the , like the other ones. (Is this a bug? Or is the comma optional?

Same thing in the "catalan" file.

Most FIXME tags and Observation tags were written after the closing of the entry - some not - and where located, as a comment, in the middle of the field, like in:

 editorco         = {{obrada i komentari}% gender neutral
                      {obrada i komentari}},

I moved the remarks to be written always to the end of the field as in:

 editorco         = {{obrada i komentari}
                      {obrada i komentari}}, % gender neutral

The comments are entered in a "note" filed in the db and preserved for future reference and for possibly betterning the quality of the translation. There is a chance that someone will want to make a comment on the first part of the field, but that is a minor issue - the comment is always free form and can contain that specification. Files affected by this change are the "finish.lbx"

The french file had non-standard breaks like:

  bycollaborator   = {{avec la collaboration \smartof}{avec la
      coll\adddotspace\smartof}},

The db writes all of them with a standard break at the end of the field now.

I have no idea what to do with with comments lose in the file like this on in "czech.lbx":

  % V pripade potreby pouzit lokalni lbx soubor

pauloney commented 10 years ago

One question:

I see that the construct

  inherit          = {german},

is used to organize the Austrian/German and the two Norwegian files and works inside: "\DeclareBibliographyStrings{", along with

\InheritBibliographyExtras{german}

that works from outside

Then the contruct:

\InheritBibliographyExtras{english}
\InheritBibliographyStrings{english}

is used to organize the 5 varieties of English (american.lbx, australian.lbx, british.lbx, canadian.lbx, canadian.lbx, newzealand.lbx) in a bit of contorted way:

The file UKenglish.lbx points to

\InheritBibliographyStrings{british}

and that points to:

\InheritBibliographyStrings{english}

and the USenglish.lbx file has the same sequence, but passing via another different file "american.lbx" and ending in the same place. It would be way better to name the files:

en.lbx en-UK.lbx en-US.lbx en-AU.lbx en-CA.lbx

and they lang-COUNTRY.lbx files point to the lang.lbx (en.lbx) for \InheritBibliographyStrings without passing via intermediary files. Then a few sym links could provide backwards compatibility!

And NONE of it is used to organize the Portuguese files that share some 250 terms! Any reason for that ?

Paulo Ney

pauloney commented 10 years ago

Questions on factorization, naming and organization of the files.

I am assuming that the "factored" way of the English and Norwegian files are the preferred way. There is one file for the language itself (norwegian.lbx) and two more localized (norsk.lbx and nynorsk.lbx) that refer to it. So I'll change the Portuguese ones to be factored as well. One language file will contain all common terms and the locale specific files will contain the local differences.

In this way the two files country-specific files will quite small (some 15 to 30 entries only) and a common language file (that does not exist at the moment) will show up. But my biggest issue here is with the names. It will be a mess to carry the current naming scheme beyond the small language set of 18 languages we have right now - specially when you take in consideration that certain languages (eg. Azeri) can be written in many scripts (Arabic, Latin and Cyrillic) in multiple countries (Azerbaijan, Iran, ...) and in itself would require various files named in weird ways.

The correct way to do this is to name them as:

la-Scrp-CO

where "la" is the ISO language code (2- or 3-letter), "Scrp" is the 4-letter ISO code for the script of the language (if necessary to distinguish it) and CO is the ISO 2-letter code for the location. Examples:

en-US
az-Arab-IR
zh-Hant-TW
bg-BG
pt
pt-BR
pt-PT

and the set of 33 files we have right now sometimes named after a language (catalan, ngerman,...), sometimes after a country (american, austrian, ...), sometimes wrongly named (portuges.lbx), sometimes with caps (UKenglish.lbx), sometimes not (british.lbx)... would just become symm links pointing to the real thing.

Preferably not even the sym links would be visible and this could be moved to an internal file named

compatiblity_with_babel

and the user will not be exposed to it.

PN

pauloney commented 10 years ago

And the final one of today on organization of names of countries.

I guess anyone that has seen this list

  countryde        = {{Germany}{DE}},
  countryfr        = {{France}{FR}},
  countryuk        = {{United Kingdom}{GB}},
  countryus        = {{United States of America}{US}},

feels that it is lacking in a few ways: A bit short, a bit culture-centric, somewhat random, it does not cover the 4 largest publishing countries in the world (missing China - #2 and Russia - #4),...and it does not even cover the name of the countries we support in the LBX files themselves. Above all, having one country translated and not another in a Bibliography is quite weird - specially when the language is written in a different script!

I have one book that cites standard math-books in many different languages and one weird thing there is to have a "United States" in the middle of a perfectly formatted Katakana entry.

The database contains the names of all 250 countries/locales in some 180 combination of languages and scripts with possible support for 500 locales. So one could easily include the name of all countries in all the languages that we support, and since humans are not supposed to be reading/editing LBX, XML files anyways - this is quite okay. If someone feels that it decreases the readability/editability of the lbx file than one could move it apart into a set of files:

countries-en-US
countries-az-Cyrl-AZ
countries-pt-BR

and while w are at it we could also address this problem:

  countryeu        = {{European Union}{EU}},
  countryep        = {{European Union}{EP}},

Paulo Ney

plk commented 10 years ago

@aboruvka is the real expert on this - but it looks like what you are doing is rather nice and I will help in any way I can to integrate this. One problem I see so far is that symlinks aren't going to work on Windows ... we'd have to do backwards compat in some TeX wrapper I think.

aboruvka commented 10 years ago

lbx filenames follow babel identifiers, so these should not be altered. I've already shared my comments on the country strings - core support is demonstrative, not exhaustive and I don't think we should change this.

I would assume PL created the English files. Portuguese files were contributed, and the author likely wasn't aware of inheritance to avoid redundancy.

plk commented 10 years ago

The problem is essentially that biblatex loads the .lbx files based on the name passed to the "babel" option (this will be renamed to "autolang" anyway because polyglossia support is now working and so we need to move away from babel-specific naming for options). So, we would need some sort of mapping inside biblatex to handle any .lbx renaming. The problem is that we don't control the babel/polyglossia language names which are also passed the value of the babel/autolang option in order to change hyphenation patterns etc. This is a bit tricky.

pauloney commented 10 years ago

I meant to spend some time this last week researching how symlinks work on Windows. but did not have the time. I understand that it is fully supported in Windows 7, 8 and Vista, but patchly supported in XP and 2000, and a total mess over remotely mounted files systems, so most software packagers stay away from it.I understand we have to follow Babel because of internal issues but coordinating the name space with Babel and Polyglossia is not a far fetched idea ...

Speaking about that, what will be the future of Babel inside BibLaTeX once Polyglossia is supported ? The idea is that Babel will continue to be supported ?

Glad to hear that the internal mapping of the filenames will be necessary even for other reasons. We may one day go away from "portuges" dictated by DOS! It took me more time to identify the languages of all these packages (babel, polyglossia and biblatex) than to write a parser for their files. Reading CSL files was easy because they are named after an ISO standard.

Assuming that you guys are okay with everything else that I raised except for the Countries and Naming, where should I sent the initial set of files for testing?

Paulo Ney

plk commented 10 years ago

Babel will continue to be supported. We need to think a bit about how this would fit into a release workflow unless you are going to be available as a hub for .lbx generation and release? The naming thing also need some thought. Internally we need names for two things, .lbx file names and babel/polyglossia language names for switching. I haven't had time to think about this yet and won't for a few weeks. I can look at the files etc. but it will be in November.

pauloney commented 10 years ago

I can perfectly serve as the hub for the generation of the lbx files at release time, no problem.You decide on the naming because there are some internal issues of TeX and cross support and a bit of a religious war wagging it seems. Let me know what they will be and I'll generate them with ISO names and rename it wit a shell script.

The sole fact that BibLaTeX is going to exchange names with Babel and Polyglossia at the same time and the fact that the name space of these two programs is DIFFERENT already says we need a common way to call and load the files! The last thing one needs here is two different ways to load "USEnglish" if you are using Babel or Poly, if not for the sake of the user, at least for your sake of having to deal with it internally.

I was looking at some work that Javier did on language loading for the release this last weekend and it seems darn easy to add an alternate name to load a language in Babel - preserve everything else and add an alternate name. It looks so easy that even I fell to confident I can provide a patch for that, so I am going to ask him and Arthur about providing this.

plk commented 10 years ago

This is true - we already have alternative names in babel/polyglossia. Perhaps if they would accept the ISO names too, it would make life a lot easier ...

pauloney commented 10 years ago

I wonder is someone would elaborate on the de-facto use of

\adddot
\adddotspace
\addcomma
\addcolon

The manual says:

4.7.3 Bibliography and citation styles should always use
these commands instead of literal punctuation marks.

but I see lots of literal punctuation marks ( . , : ) in the lbx files.

Paulo Ney

aboruvka commented 10 years ago

In most cases the punctuation commands in the string definitions are preferable. The lbx files are contributed and in most cases only lightly edited by the maintainers, so they don't always adhere to best practices.

pauloney commented 10 years ago

Are there any difference in between the constructs:

\DeclareBibliographyStrings{
  inherit          = {german},

and

\InheritBibliographyStrings{german}

used in the structure of the lbx files ? PN

aboruvka commented 10 years ago

Both make the current localization module inherit all the string definitions from german.lbx. The former just gives you room to define new strings or redefine some existing ones.

pauloney commented 10 years ago

Is there a test sequence for BibLaTeX ? Would running the examples and comparing the results be a good test ? I have finished factoring the locale files and wanted to replace the old one and test to see if I get errors or the same PDF files.

Paulo Ney

pauloney commented 10 years ago

I finished with the first part of the work (db infrastructure, parser and lbx-file-writer), and the new files are here:

https://drive.google.com/file/d/0B3mOBzjP3W1nMW9tYUQwcFdMWDQ/edit?usp=sharing

There are 3 new files (en, de and pt) with termination .lang for lack of a better choice. They contain the factored strings of the locales that use these 3 languages. I refrain from doing any other changes to the files, so even things that are wrong (like the \addot on line 364 of portuguese.lbx) have been left as-is, so you could more easily check that nothing has been added or lost during this process. I tried to do some testing on my own, but that is harder for me, I include the files back (inside each other) and processed the examples and they all seem to be fine. Hopefully you can test it better.

When I get a green-light on the files, I'll move to fix the things that are wrong with individual translations, etc ...

I can now analyze and compare the translations and more easily complete the terms that are missing and produce the files for new languages.

Paulo Ney

plk commented 10 years ago

There is a test suite in the git repository but it doesn't verify identical PDF output, just whether there were any errors. I'm not sure hoe many of the examples actually test language files though. I will probably find time to look next week.

pauloney commented 10 years ago

I am trying to build testing sequence specific to the lbx files, starting from the examples, and there is one that does not run on TeXLive 2013:

03-localization-keys

with an error

(/usr/local/texlive/2013/texmf-dist/tex/latex/latexconfig/color.cfg
! Missing = inserted for \ifnum.
<to be read again>
                   \bgroup
l.13   }
        %
? e

pauloney commented 10 years ago

I loaded and processed now the files from the development version (the previous set had the files of TeXLive 2013), and generated a new set of file, that you can find at:

https://drive.google.com/file/d/0B3mOBzjP3W1nZmptam1lN1I2NkU/edit?usp=sharing

I tested it extensively ...

First on the number of strings in factored/non-factored files. They match the original set.
And then processing the examples with the old set and the new - they produce a set of PDF files that match exactly.

So I assume these are ready to be checked in, as soon as you can provide a facility for one file to include the other, in the case of the 3 factored languages (en, pt, de).

No strings have been modified, this is just the modification of the lbx due to the new framework, but where the original lbx files had incorrect syntax (missing commas, etc ..) this has been fixed. The most affected files where greek and catalan, but some other ones have been changed as well. If you want a short blurb to explain the changes at check in time, it could be:

Files are now written from a db for correct syntax, string comparison, language factoring and translation development.

Then on the next releases I'll move to:

Correct strings that need work
Complete on-going translations
Develop a test-suite specifically for the lbx files
Release new language sets

PN

plk / biblatex

Use standardized language identifiers for lbx files #160