plk / biber

Backend processor for BibLaTeX
Artistic License 2.0
338 stars 38 forks source link

Sort non-Arabic numbers #284

Closed moewew closed 2 years ago

moewew commented 5 years ago

Inspired by https://tex.stackexchange.com/q/507075/35864.

The font used in the MWE is available for download at https://www.google.com/get/noto/#sans-deva, it is enough to unpack NotoSansDevanagari-Regular.ttf into the same folder as the MWE, no installation is required.

Run the MWE with XeLaTeX.

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Script=Devanagari,Mapping=devanagarinumerals]{NotoSansDevanagari-Regular.ttf}
\usepackage[style=authoryear, backend=biber]{biblatex}
\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}
@book{अर्जुनवाडकर,
  title     = {मराठी व्याकरण : वाद आणि प्रवाद},
  author    = {अर्जुनवाडकर, कृष्ण},
  year      = {१९८७},
  publisher = {पुणे : सुलेखा प्रकाशन}
}
\end{filecontents}
\addbibresource{\jobname.bib}

\begin{document}
\cite{अर्जुनवाडकर}
\printbibliography
\end{document}

Biber complains about the year field

[250] Utils.pm:300> WARN - year field '१९८७' in entry 'अर्जुनवाडकर' is not an integer - this will probably not sort properly.

The power of Wikipedia translation (https://hi.wikipedia.org/wiki/%E0%A5%A7%E0%A5%AF%E0%A5%AE%E0%A5%AD) shows that १९८७ is actually just 1987. And indeed these numerals have a one-to-one correspondence to Arabic numerals (https://en.wikipedia.org/wiki/Indian_numerals), so that it seems desirable to sort them just along Arabic integers. Would it be possible to have Biber do that?

u-fischer commented 5 years ago

I have the impression that problems/questions about non-arabic number systems are popping out in various places. Index generation fails if the page numbers should be in (real) arabic, polyglossia seems to implement a \Localnumber command (https://github.com/latex3/latex3/issues/616), and now here ;-)

I'm not sure if it is a good idea to allow the input year = {१९८७}. Even if biber can sort it, what about biblatex? I'm rather certain that there are macros somewhere who use the year as number. Wouldn't is be better to do the number formating later, (perhaps dependant on the langid field)?

plk commented 5 years ago

I'm not sure about this either - since every .bib item would have to have such a script (mixed script years in the same bibliography makes no sense), it should really be formatting issue. Also, we generally deprecate YEAR for DATE which is an ISO format field anyway.

dcpurton commented 5 years ago

With harflatex, it will be possible to use babel to automatically deal with changing the numbering with its mapdigits option:

\documentclass{article}
\usepackage{harfload}
\usepackage{babel}
\babelprovide[import=mr,mapdigits,main]{marathi}
\babelprovide[import=en,language=Default]{english}
\babelfont[marathi]{rm}[RawFeature={mode=harf}]{Noto Sans Devanagari}
\babelfont[english]{rm}{Latin Modern Roman}
\usepackage{csquotes}
\usepackage[style=authoryear, backend=biber]{biblatex}
\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}
@book{अर्जुनवाडकर,
  title     = {मराठी व्याकरण : वाद आणि प्रवाद},
  author    = {अर्जुनवाडकर, कृष्ण},
  year      = {1987},
  publisher = {पुणे : सुलेखा प्रकाशन}
}
\end{filecontents}
\addbibresource{\jobname.bib}
\begin{document}
\cite{अर्जुनवाडकर}
\printbibliography
\end{document}
u-fischer commented 5 years ago

You don't need harf mode for mapdigits, it works with lualatex too (but harf mode is probably required for the devanagari script).

moewew commented 5 years ago

Mhhh, on the one hand I agree that it would be nicer for some applications (especially those involving arithmetic) if all numbers were to end up on the LaTeX side as Arabic numerals. On the other hand I think it would be nice if people were able to use the input they prefer for numerals if there is a simple one-to-one correspondence that can be exploited.

I personally also think that it would not be unacceptable to have works with Indic years and Arabic years listed in the same bibliography.


Given all that I'm not too sure I feel strongly about this any more, so I'll be happy with any decision you might want to make here.

u-fischer commented 5 years ago

I think that if the input १९८७ is allowed it should be treated as an unformated "pure" number and not imply some output format - which means that the bbl should contain 1987 and additional formatting commands will be needed anyway.

niruvt commented 5 years ago

I don't have sufficient knowledge of core TeX coding (or actually any kind of programming), also I don't know whether this is the right place to ask this, but I've a genuine question. Why in this unicode era only 0-9 are considered as integers in any sort of programming? I am an enthusiast for making my script programming friendly. What should I really aim to make it happen? (By programming friendly I mean it should be considered equally efficient as ASCII characters.) Why are ASCII characters "normal" and characters from my script (or any other unicode-based script) "special"? Trust me I'm not just outraging. I'm genuinely trying to understand the functions of ASCII characters. What should be done to attain the same functions for my script?

PS - I am not happy with any kind of 'assistant programmes' to convert ASCII integers to the 'number-characters' (as the coding understands them) of my script. I want the characters to attain the status of integers wherever they are typed. This initiative will increase and strengthen the use of native scripts and ultimately strengthen the native languages. I am sorry if this kind of comment was not expected here.

u-fischer commented 5 years ago

@NiranjanTambe I don't think that it would be managable if lots of number systems were allowed generally. It is already difficult to handle that some people write decimal numbers with periods (3.14) and other (like me) with comma (3,14). E.g. try to imagine how it should work if you mix the various number types, and if some of them are not 10-based, or go in the other direction. And if someone devise nevertheless a working system: numbers are used in so many places nowadays that the cost to change the handling and interfaces everywhere would be quite high. A standard number system has a lot advantages.

niruvt commented 5 years ago

How are integers defined in general? Can't we assign the same features (with the same values) to the different characters which have no role in the scenario now? eg. "value" and "function" of 1 shall always be the same as १, but at least the visual form should be liberal and not Latin-centric is what I feel. My question was regarding the dominance (or the monopoly) of ASCII numerals. XeLaTeX, LuaLaTeX, python3 are some good examples where people have tried to decrease the dominance of ASCII, but still when it comes to numerals, it's again ASCII everywhere... Can't we think of trying to abstract out the functions and values of numerals from their visual forms?

u-fischer commented 5 years ago

@NiranjanTambe Well your १ has its own unicode position (U+0967). You would have to change this if you want it to be simply a different looking 1. So write to the responsable of unicode ;-).

dcpurton commented 5 years ago

@u-fischer, yes, the harf mode was for the benefit of the script in this case.

moewew commented 5 years ago

@NiranjanTambe I guess it's mostly a matter of history. The development of programming languages and computers as they are used today was mainly driven by Western (and mostly even English-speaking) people and institutions, so there is an inherent bias towards Western/English script, language and customs in those systems. Historically, there was no Unicode (certainly when TeX was invented), computers only supported a very limited set of characters (maybe something like US-ASCII) and that's what many programming languages fundamentally still use today. (See https://en.wikipedia.org/wiki/English_in_computing)

If you wanted programming languages to support other scripts today that would quite probably be possible in general (not sure about TeX specifically and how easy it would be to bolt-on Unicode support in the way you imagine, but if one were to start a re-implementation with more modern Unicode-aware tools and languages it may well be much easier). But for many languages there would be questions about backwards compatibility, support on other (older or simpler) systems etc etc. And then there would be the question of whether or not a project like this is 'worth it'. For programming languages like C it probably doesn't bring a huge improvement to just allow Indic numerals instead of Arabic numerals and leave the rest as is: You'd still be dealing with a language rooted in English and the function names would come from the English language. I don't know about the merits of 'translating' the entire programming language into a different 'base language', say French to produce French C or Hindi C. (But see https://en.wikipedia.org/wiki/Non-English-based_programming_languages.)

I'm genuinely interested, would you say that accepting numbers like १९८७ as integers while still requiring all other function names, macros etc. to be in 'English' in a programming language would be an improvement?

For TeX the question is maybe more intricate, since TeX is a typesetting language. You can do programming, where you probably expect English-based terminology, but you can also produce document output, where one would hope that as many languages of the world are supported. Things become tricky, when the programming and typesetting part are used so closely together that you want or need the programming bit to accept non-English terminology.

niruvt commented 5 years ago

@moewew No and yes! Developing macros and functions in languages other than English is the job of the concerned communities, but I would say giving a chance for such developments can possibly be a great initiative taken by the programming world. So for your last question I would say yes it is an improvement for me, because then my community can rely on our own script (and ultimately on our own language) for the most advanced functions of this era. I was trying hard to not bring the socio-political scenario of the languages in this discussion, but this might be the point where I should express the reasons for this concern. In many non-English speaking countries (including ours) native languages are considered "low" in the language education, as the higher functions of the languages are not accessible to them and English which is not our native language is the language which our future generations learn to get the jobs, compete with the world and what not... Accepting ०-९ as integers and अ-ळ as regular characters in the world of programming will encourage the people to also focus on their languages, make them computer friendly and believe that our languages can also function like English. So now is it really 'worth it'? Yes! It is. I also said no, because this is not the 'complete' picture which I'm dreaming. Non-English speaking communities will have to struggle hard to achieve all of it, but as I said earlier acceptance can be a starting point.

u-fischer commented 5 years ago

@NiranjanTambe I'm on the latex team. Currently the active members are English, German, Italian, French, Spanish, Brasilian. Do you think it would improve latex for you if we all would code and document in our native language? Would you like if if biblatex and tikz were in german and the babel commands and documentation in spanish?

niruvt commented 5 years ago

@u-fischer I think it would improve LaTeX if these communities are 'also' allowed to program in their languages. Nobody is going to stop English computing. (Even I don't wish to stop it.) I am just trying to make some space for other languages in the computation. Don't take it wrong. You said '१' has it's own Unicode position (U+0967), but so has '1' (U+0031)! The problem with १ is it is not an ASCII character. That's what troubles me. Don't you think LaTeX will be more powerful if it exceeds the limits of English? Few people will have to work hard for the job of translation, but that'll be tremendously useful for crores of people who might not wish to learn English.

moewew commented 5 years ago

@u-fischer makes an extremely good point that I was going to make as well: Programming languages are not only tools to communicate with a computer what you want it to do, they are also ways to communicate with other people (other people may want to or have to read and understand your source code to do things of their own or make use of your work). Fragmentation into different sublanguages and scripts would make communication on the whole a lot harder.

I just imagined the MWE from above would have been in Devanagari script: I wouldn't have been able to make heads or tails of it. A common lingua franca is extremely useful.

Now of course an argument could be made that for personal projects the 'base language' does not matter and it might help people to use their native script and language, I'm not entirely convinced by that because I believe that sooner or later one is bound to communicate with other people or is going to have to come into contact with 'untranslated' bits of the programming language, but I can see why people who feel differently about the English language than I do would think differently.

niruvt commented 5 years ago

I never said that programming languages are just tools to communicate with computers. In fact I understand that programming languages have a very important function of communicating with people who are using them and therefore I wish to extend this function to other languages too.

One point I'll like to emphasize on is regarding the status of English. I am not trying to deny the importance of it. Let English be the Lingua Franca of programming languages, I'm questioning the monopoly. There are only two possible options available for people from non-English speaking developing countries.

  1. Learn English and secure yourself.
  2. Speak their languages without having a place in this globalized world.

What if we have several LaTeXs, several pythons, several Cs? People are not forced to understand English for computation. Not just for personal use, but for many social reasons heteroglossic computation is very useful. People may or may not use English, but it will be a wise choice made by them. Let's have an example of heteroglossic LaTeX. You might not know what is a Pothi. It is a style of religious books written in ancient India, but Indians know what it is. If LaTeX is also present in Hindi & there is a document class named pothi, Indians will happily use it. Now if we want to have it we have to program commands in English and in Latin script. Who is going to use such provisions outside India? It is a complete intra-cultural thing. People may use regular English LaTeX for writing articles in journals or writing books. People may also use Hindi LaTeX for writing Hindi or for writing English. What stops us from being so open? Programming will definitely spread widely and used widely if it extends it's borders. Yes we'll need many StackExchanges, many Githubs, but why not? Why is programming still in the hands of English-knowing communities? I am concerned with it's distribution, not with it's existence!

Aerijo commented 5 years ago

Why is programming still in the hands of English-knowing communities? I am concerned with it's distribution, not with it's existence!

Is this rhetoric? Because I don’t think anyone is hoarding programming for themselves. I imagine it was because the English programs became popular, and the people maintaining the popular languages did not wish to fragment or over complicate communities and tools. There is no conscious effort stopping programming languages based on other languages from existing. People can choose to use and make whatever they like. Programming as an idea is not in the hands of anyone. There is nothing stopping it from being ‘open’, besides lack of popularity for the idea.

Also, supporting multiple languages (or any change) is not free; there is a maintenance cost. Biblatex is a free tool, and unnecessary features can waste the limited time and effort that could be used on other things. This cost also extends to any tools that want to work with biblatex. I know mine will not support alternative language inputs, even if it is ultimately supported here. It’s just not worth my time implementing and maintaining.

As others have said earlier, supporting output in the language of choice is a necessary feature for this kind of tool. But supporting alternative input, especially in a field that is to be parsed and used to control other parts, is less useful.

plk commented 5 years ago

Apart from the theoretical discussion. Biber 2.14 DEV implements general Unicode numeric conversion for sorting purposes now. This means that things like:

% 1985 Roman numerals
@BOOK{test1,
  YEAR = {MCMLXXXV}
}

% 1987 in Sanskrit
@BOOK{test2,
  YEAR = {१९८७}
}

Will actually sort correctly. There are internal Unicode facilities in Perl which allow a completely general detection and conversion of numerics from arbitrary scripts with the caveat that the numerics must all come from the same script. This can be relaxed but there is no reason to - it's in fact regarded as a security feature in Unicode ("Script Run" detection). Now, notice that this only works with the YEAR (and MONTH) field because we parse DATE by ISO standard and they have no provision for alternate scripts. It would be relatively easy to allow scripts for DATE as a pre-parse step, as long as the numeric association was effectively a transliteration of the ISO format. Thoughts on that?

Effectively, this means we can support Unicode even in integer sorting for, for example, fully Sanskrit bibliographies, which does seem reasonable.

With respect to actual programming languages which allow Unicode (for the actual language, not data), see Perl 6 - you can actually change the language and code in arbitrary Unicode scripts so that theoretically you could have an entire program in Sanskrit. Not sure how much use this would generally be in the culture of cut and paste github programming though ...

u-fischer commented 5 years ago

@plk what does "that the numerics must all come from the same script" means? In one entry or in all year fields everyhere in the complete bib (in all bib's) used be the document?

moewew commented 5 years ago

Cool. Does that only apply to year and month or also to other fields declared as integer? How would it interface with noroman?

It would only make sense to transliterate date if people actually think this is a useful input format.

@NiranjanTambe Would it be natural to give the date 15. January 1987 in ISO format (1987-01-15) as १९८७-०१-१५ or would you go for the ASCII 1987-01-15?

plk commented 5 years ago

In one field/string - so it's barely a limitation at all. See the comments on the /u modifier here and the section on "Script Runs" on the same page: perldoc.perl.org/5.30.0/perlre.html

u-fischer commented 5 years ago

How is then the output in the bbl? Are the years normalized there or are the styles preserved?

plk commented 5 years ago

@u-fischer - the .bbl is not changed - it outputs what the user input. It's just the internal sorting data that has the converted numerics to allow proper comparison.

plk commented 5 years ago

@moewew - it's all integer fields. noroman prevents the numeric conversion of anything detected as roman for sorting purposes.

u-fischer commented 5 years ago

@plk but then we are back at the problem I mentioned above: if some biblatex code tries to compare the years or month as numbers it will fail now.

plk commented 5 years ago

True but there is no functional change so far, just sorting works. What we would need to do is to output a normalised numeric field for comparisons.

moewew commented 5 years ago

@u-fischer But will that code not have had the same issue without this change. This is purely about sorting. I can sort of see the appeal in producing a normalised output, but on the other hand it might be counter-intuitive if not downright frustrating for users to have some parts of their input changed (from Indic to Arabic numbers for example) without them being able to do something about it. (Take the number of people complaining about the URI encoding for url.)

u-fischer commented 5 years ago

@plk @moewew well yes, but currently you get a warning that the input format is wrong and if comparision code fails one can rightly say that it is because of erronous input. But now you are actually changing the dataformat. If MCMLXXXV should be a valid input, then one should think what this means on the output side.

E.g.

plk commented 5 years ago

I can easily add a biber option to normalise integer fields to arabic but I suspect that what would be better would be to auto-output normalised fields with a naming convention that comparison macros would check first if they exist.

The year/date conflicts are handled anyway with a warning as you shouldn't have both in the same entry.

moewew commented 5 years ago

Maybe it's time to revive the multiscript project ;-)?

I guess a normalised field version that people can use if they are concerned about integers would be a good solution.

plk commented 5 years ago

I would like to sort out multiscript but it's hard. It's not too horrible to design an extended .bib format for this but it gets really nasty when you have multiple scripts in biblatex as comparison macros becomes exciting as you have to decide which script variants to compare. Choosing script variants for printing becomes messy too. If we can think of a way to do this, I would be happy but I remember with fear the months I spent on this several year ago.

In the interim, I think I will try to make date parsing work with script dates just so sorting works.

plk commented 5 years ago

There is something implemented in 2.14 DEV now as a basis for discussion. It is now possible to do normal ISO date things with numeric equivalent scripts, e.g:

  DATE = {१९८७-०१-१५/१९८८-०५-११}

and this will sort correctly and be output as given in the .bbl. It would be easy to output normalised arabic versions too but this probably falls in the scope of a larger discussion about the formats/interfaces for proper multiscript support.

More work needs to be done in generalising as the standard Unicode interfaces don't really deal with things like Chinese numbers well because they don't form decimal sequences like arabic and Sanskrit do.

plk commented 4 years ago

Anything more in this to do?

niruvt commented 3 years ago

@plk Kindly have a look at this comment. (Note that you might need to install marathi.lbx to see the expected output.) As noted in the next comment, the issue goes off when Latin numerals are used. Is it possible to support Devanagari fully with Biber, so that any Devanagari numeral can be interpreted as an integer?

plk commented 3 years ago

@NiranjanTambe - just to be clear, the issues with series was that it didn't sort correctly?

niruvt commented 3 years ago

Nope, it is a different issue. I'll post it on another thread.