plk / biblatex

biblatex is a sophisticated bibliography system for LaTeX users. It has considerably more features than traditional bibtex and supports UTF-8
520 stars 118 forks source link

IAST Sanskrit Collation: Letters with diacritics are not sorted properly #765

Closed ppasedach closed 6 years ago

ppasedach commented 6 years ago

(Already noted here)

IAST-transliterated Sanskrit does not sort correctly any more, it appears as if somewhere diacritics are stripped off or something else happens with them, a and ā, m and ṃ, h and ḥ, ś, ṣ and s, t and ṭ are messed up in the example, I would assume it happens to all diacritical combinations.

test_long.pdf test_long.tex.gz

moewew commented 6 years ago

Thank you for reporting this. Unfortunately I don't really know about the transliteration Biber stuff and @plk is a bit snowed under at the moment, so I can't promise a quick fix.

As far as I know Biber uses an external library to do the transliteration (Lingua::Translit), so this could be either a bug in the external library or a bug in how Biber handles the returned results for sorting.

I don't suppose you could check that the Lingua::Translit library returns the expected values?

Full text of the MWE

```latex \documentclass{article} \listfiles \usepackage{polyglossia} \setdefaultlanguage{sanskrit} \newfontfamily\sanskritfont{Latin Modern Roman} \usepackage{fontspec} \usepackage{biblatex} \usepackage{filecontents} \addbibresource{\jobname.bib} \begin{filecontents*}{\jobname.bib} @misc{aka, title = {aka}, } @misc{aṃka, title = {aṃka}, } @misc{aca, title = {aca}, } @misc{aṃca, title = {aṃca}, } @misc{aṭa, title = {aṭa}, } @misc{aṃṭa, title = {aṃṭa}, } @misc{ata, title = {ata}, } @misc{aṃta, title = {aṃta}, } @misc{apa, title = {apa}, } @misc{aṃpa, title = {aṃpa}, } @misc{aya, title = {aya}, } @misc{aṃya, title = {aṃya}, } @misc{ara, title = {ara}, } @misc{aṃra, title = {aṃra}, } @misc{ala, title = {ala}, } @misc{aṃla, title = {aṃla}, } @misc{ava, title = {ava}, } @misc{aṃva, title = {aṃva}, } @misc{aśa, title = {aśa}, } @misc{aṃśa, title = {aṃśa}, } @misc{aṣa, title = {aṣa}, } @misc{aṃṣa, title = {aṃṣa}, } @misc{asa, title = {asa}, } @misc{aṃsa, title = {aṃsa}, } @misc{aha, title = {aha}, } @misc{aṃha, title = {aṃha}, } @misc{aḥka, title = {aḥka}, } @misc{aḥca, title = {aḥca}, } @misc{aḥṭa, title = {aḥṭa}, } @misc{aḥta, title = {aḥta}, } @misc{aḥpa, title = {aḥpa}, } @misc{aḥya, title = {aḥya}, } @misc{aḥra, title = {aḥra}, } @misc{aḥla, title = {aḥla}, } @misc{aḥva, title = {aḥva}, } @misc{aḥśa, title = {aḥśa}, } @misc{aḥṣa, title = {aḥṣa}, } @misc{aḥsa, title = {aḥsa}, } @misc{aḥha, title = {aḥha}, } @misc{Agnipurāṇa, title = {Agnipurāṇa}, } @misc{Agniveśyagṛhyasūtra, title = {Agniveśyagṛhyasūtra}, } @misc{Atharvavedapariśiṣṭa, title = {Atharvavedapariśiṣṭa}, } @misc{Abhayapaddhati, title = {Abhayapaddhati}, } @misc{Amoghapāśakalparāja, title = {Amoghapāśakalparāja}, } @misc{Arthaśāstra, title = {Arthaśāstra}, } @misc{Alaṃkārakārikā, title = {Alaṃkārakārikā}, } @misc{Īśānaśivagurudevapaddhati, title = {Īśānaśivagurudevapaddhati}, } @misc{Ṛgvidhāna, title = {Ṛgvidhāna}, } @misc{Kalyāṇakāmadhenu, title = {Kalyāṇakāmadhenu}, } @misc{Kiraṇatantra, title = {Kiraṇatantra}, } @misc{Kubjikāmatatantra, title = {Kubjikāmatatantra}, } @misc{Kuṭṭanīmata, title = {Kuṭṭanīmata}, } @misc{Kṛṣṇayamāritantrapañjikā, title = {Kṛṣṇayamāritantrapañjikā}, } @misc{Guhyasamājatantra, title = {Guhyasamājatantra}, } @misc{Guhyasamājamaṇḍalavidhi, title = {Guhyasamājamaṇḍalavidhi}, } @misc{Guhyasiddhi, title = {Guhyasiddhi}, } @misc{Caṇḍamahāroṣaṇatantra, title = {Caṇḍamahāroṣaṇatantra}, } @misc{Caṇḍamahāroṣaṇatantrapañjikā, title = {Caṇḍamahāroṣaṇatantrapañjikā Padmāvatī}, } @misc{Chandaḥsaṃgraha, title = {Chandaḥsaṃgraha}, } @misc{Chandaḥsāra, title = {Chandaḥsāra}, } @misc{Jayākhyasaṃhitā, title = {Jayākhyasaṃhitā}, } @misc{Jñānaratnāvalī, title = {Jñānaratnāvalī}, } @misc{Jyotiḥsāra, title = {Jyotiḥsāra}, } @misc{Tattvaratnāvalī, title = {Tattvaratnāvalī}, } @misc{Tantrasadbhāva, title = {Tantrasadbhāva}, } @misc{Tantrāloka, title = {Tantrāloka}, } @misc{Divyāvadāna, title = {Divyāvadāna}, } @misc{Derge, title = {Derge}, } @misc{Nityādisaṅgrahābhidhānapaddhati, title = {Nityādisaṅgrahābhidhānapaddhati}, } @misc{Niśvāsatattvasaṃhitā, title = {Niśvāsatattvasaṃhitā}, } @misc{Niśvāsakārikā, title = {Niśvāsakārikā}, } @misc{Parākhyatantra, title = {Parākhyatantra}, } @misc{Pārameśvaratantra, title = {Pārameśvaratantra}, } @misc{Pūrva-Kāmika, title = {Pūrva-Kāmika}, } @misc{Pratiṣṭhālakṣaṇasārasamuccaya, title = {Pratiṣṭhālakṣaṇasārasamuccaya}, } @misc{Brahmayāmalatantra, title = {Brahmayāmalatantra}, } @misc{Bhairavapadmāvatīkalpa , title = {Bhairavapadmāvatīkalpa }, } @misc{Mañjuśriyamūlakalpa, title = {Mañjuśriyamūlakalpa}, } @misc{Mataṅgapārameśvarāgama, title = {Mataṅgapārameśvarāgama}, } @misc{Mālinīvijayottaratantra, title = {Mālinīvijayottaratantra}, } @misc{Muktāvalī, title = {Muktāvalī}, } @misc{Mṛgendratantra, title = {Mṛgendratantra}, } @misc{Bṛhatsaṃhitā, title = {Bṛhatsaṃhitā}, } @misc{Rauravasūtrasaṅgraha, title = {Rauravasūtrasaṅgraha}, } @misc{Laghutantraṭīkā, title = {Laghutantraṭīkā}, } @misc{Laghuśaṃvaratantra, title = {Laghuśaṃvaratantra}, } @misc{Vajrāvalī, title = {Vajrāvalī}, } @misc{Vimalaprabhā, title = {Vimalaprabhā}, } @misc{Vīṇāśikhatantra, title = {Vīṇāśikhatantra}, } @misc{Śāradātilaka, title = {Śāradātilaka}, } @misc{Śivatattvaratnākara, title = {Śivatattvaratnākara}, } @misc{Sampuṭatantraprakaraṇārthanirṇaya, title = {Sampuṭatantraprakaraṇārthanirṇaya}, } @misc{Sampuṭodbhavatantra, title = {Sampuṭodbhavatantra}, } @misc{Sarvajñānottaratantra, title = {Sarvajñānottaratantra}, } @misc{Sarvajñānottaravṛtti, title = {Sarvajñānottaravṛtti}, } @misc{Sarvatathāgatatattvasaṅgraha, title = {Sarvatathāgatatattvasaṅgraha}, } @misc{Sarvatathāgatādhiṣṭhānasattvāvalokanabuddhakṣetrasaṃdarśanavyūha, title = {Sarvatathāgatādhiṣṭhānasattvāvalokanabuddhakṣetrasaṃdarśanavyūha}, } @misc{Sādhanamālā, title = {Sādhanamālā}, } @misc{Sārdhatriśatikālottara, title = {Sārdhatriśatikālottara}, } @misc{Siddhayogeśvarīmata, title = {Siddhayogeśvarīmata}, } @misc{Siddhaikavīratantra, title = {Siddhaikavīratantra}, } @misc{Saurasaṃhitā, title = {Saurasaṃhitā}, } @misc{Svacchandatantra, title = {Svacchandatantra}, } @misc{Svāyambhuvapāñcarātra, title = {Svāyambhuvapāñcarātra}, } @misc{Svāyambhuvasūtrasaṅgraha, title = {Svāyambhuvasūtrasaṅgraha}, } @misc{Harṣacarita, title = {Harṣacarita}, } @misc{Hevajratantra, title = {Hevajratantra}, } \end{filecontents*} \DeclareSortTranslit{ \translit[title]{iast}{devanagari} } \begin{document} \nocite{*} \printbibliography \end{document} ```

ppasedach commented 6 years ago

Yes, in the link referenced above I have helped @plk create the IAST to Devanāgarī module for Lingua::Translit, and it has been already very useful for me also outside of biblatex for perl scripts converting e-texts. I will check if they still work as expected, or if some bug has crept in there since I last used them.

Is there a debugging possibility to have biber write the full transliterated strings to a file for inspection? I have seen only the sortinit fields in the bbl file containing some Devanāgarī, which I'm afraid is not enough to understand what happened.

Well, I have now inserted another item into the testing bibliography, Ānanda, and as I feared it got mixed up with the "A"-entries. The sortinit field is {अ̄} which when copying it into gedit looks like a short a with bar over it, I suspect that the diacritical combination of a and ¯ was treated separately, first the a gets transliterated to the proper devanāgarī short a, and then the diacritical mark is added to that, which makes no sense in Devanāgarī. The sortinithash field is the same as for the regular short a.

I will now dig out my transliteration perl script and test it with a current version of Lingua::Translit, and let you know about the results.

ppasedach commented 6 years ago

I have tested my scripts using Lingua::Translit, their output so far seems correct. I have also updated the module, which appears to not have changed anything. The Sanskrit collation with biblatex is still broken.

moewew commented 6 years ago

Thank you for checking that. If it's not a Lingua::Translit problem, we will have to wait for PLK. You can run Biber with the --trace option and obtain a huge .blg file that may or may not contain a bit more info on what happens to sorting (I'm not sure).

plk commented 6 years ago

Hmm, I will check on this. This must be something to do with macro decoding changes. If you run biber with the --trace flag and search the .blg file for "Keys before sort", you will see the transliterated titles and see if they look right.

ppasedach commented 6 years ago

Yes, with --trace one can see what happens. The diacritical combinations are messed up. The base characters are transliterated into Devanāgarī, and the marks are then attached to their Devanāgarī equivalents. Thus for example Agnipurāṇa is transformed into अग्निपुर̄न्̣अ . My browser, or the font used by it, refuses to display these nonsensical combinations, here an image of how it looks like: selection_107 Even if you don't read Devanāgarī you can recognise the bar above the "ra", and the dot above the "n", to which, for some reason the virāma is added, and then at the end the independent vowel a. It should actually be selection_108.

plk commented 6 years ago

This test file give me the wrong output according to the above - can you verify:

#!/opt/local/bin/perl -CS
use v5.24;
use Lingua::Translit;
use utf8;
my $t = new Lingua::Translit('IAST Devanagari');
say $t->translit('Agnipurāṇa');

अग्निपुर̄न्̣अ

moewew commented 6 years ago

I get the same output, but that could be an input issue. According to https://w3c.github.io/xml-entities/unicode-names.html the code snippet uses the combining accents U+0061 LATIN SMALL LETTER A with U+0304 COMBINING MACRON and U+006e LATIN SMALL LETTER N with U+0323 COMBINING DOT BELOW. If one uses the predefined glyphs U+0101 LATIN SMALL LETTER A WITH MACRON and U+1e47 LATIN SMALL LETTER N WITH DOT BELOW instead one gets the expected output (if I understand correctly)

#!/opt/local/bin/perl -CS
use v5.24;
use Lingua::Translit;
use utf8;
my $t = new Lingua::Translit('IAST Devanagari');
say $t->translit('Agnipurāṇa');
say $t->translit('Agnipurāṇa');
plk commented 6 years ago

Ah, ok, then it's a Unicode normalisation issue, looking into it.

plk commented 6 years ago

Please try biber 2.12 dev version from SF. For some reason calls to Lingua::Translit were not respected as a NFC boundary. I suspect this was due to another change to macro encoding structure a while ago.

moewew commented 6 years ago

I can confirm that I now get a different order than before, but whether or not that is right is a question for @ppasedach duvud.pdf

ppasedach commented 6 years ago

This pdf looks already much better on a quick look, but I am still a bit surprised by 1-26 being sorted in before everything else, and 107 at the very end. This might be according to another (Hindi-?)sorting convention, the treatment of ṃ, ḥ, and the ligature jñ considered as a letter in its own right. I still have to look at it more carefully.

On Thu, Jun 28, 2018 at 4:00 PM, moewew notifications@github.com wrote:

I can confirm that I now get a different order than before, but whether or not that is right is a question for @ppasedach https://github.com/ppasedach duvud.pdf https://github.com/plk/biblatex/files/2145594/duvud.pdf

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plk/biblatex/issues/765#issuecomment-401044895, or mute the thread https://github.com/notifications/unsubscribe-auth/AK-_oIi3I-YnOZnmgeJDARgCq_ipq_rmks5uBOFwgaJpZM4U3kUu .

moewew commented 6 years ago

According to --trace the transliterations used are

[1260] Biber.pm:3976> DEBUG - aṃka => mm,,अंक,अंक,,0
[1260] Biber.pm:3976> DEBUG - aṃca => mm,,अंच,अंच,,0
[1260] Biber.pm:3976> DEBUG - aṃṭa => mm,,अंट,अंट,,0
[1260] Biber.pm:3976> DEBUG - aṃta => mm,,अंत,अंत,,0
[1260] Biber.pm:3976> DEBUG - aṃpa => mm,,अंप,अंप,,0
[1260] Biber.pm:3976> DEBUG - aṃya => mm,,अंय,अंय,,0
[1261] Biber.pm:3976> DEBUG - aṃra => mm,,अंर,अंर,,0
[1261] Biber.pm:3976> DEBUG - aṃla => mm,,अंल,अंल,,0
[1261] Biber.pm:3976> DEBUG - aṃva => mm,,अंव,अंव,,0
[1261] Biber.pm:3976> DEBUG - aṃśa => mm,,अंश,अंश,,0
[1261] Biber.pm:3976> DEBUG - aṃṣa => mm,,अंष,अंष,,0
[1261] Biber.pm:3976> DEBUG - aṃsa => mm,,अंस,अंस,,0
[1262] Biber.pm:3976> DEBUG - aṃha => mm,,अंह,अंह,,0
[1262] Biber.pm:3976> DEBUG - aḥka => mm,,अःक,अःक,,0
[1262] Biber.pm:3976> DEBUG - aḥca => mm,,अःच,अःच,,0
[1263] Biber.pm:3976> DEBUG - aḥṭa => mm,,अःट,अःट,,0
[1263] Biber.pm:3976> DEBUG - aḥta => mm,,अःत,अःत,,0
[1263] Biber.pm:3976> DEBUG - aḥpa => mm,,अःप,अःप,,0
[1263] Biber.pm:3976> DEBUG - aḥya => mm,,अःय,अःय,,0
[1263] Biber.pm:3976> DEBUG - aḥra => mm,,अःर,अःर,,0
[1263] Biber.pm:3976> DEBUG - aḥla => mm,,अःल,अःल,,0
[1263] Biber.pm:3976> DEBUG - aḥva => mm,,अःव,अःव,,0
[1263] Biber.pm:3976> DEBUG - aḥśa => mm,,अःश,अःश,,0
[1264] Biber.pm:3976> DEBUG - aḥṣa => mm,,अःष,अःष,,0
[1264] Biber.pm:3976> DEBUG - aḥsa => mm,,अःस,अःस,,0
[1264] Biber.pm:3976> DEBUG - aḥha => mm,,अःह,अःह,,0
[1265] Biber.pm:3976> DEBUG - aka => mm,,अक,अक,,0
[1265] Biber.pm:3976> DEBUG - Agnipurāṇa => mm,,अग्निपुराण,अग्निपुराण,,0
[1265] Biber.pm:3976> DEBUG - Agniveśyagṛhyasūtra => mm,,अग्निवेश्यगृह्यसूत्र,अग्निवेश्यगृह्यसूत्र,,0
[1265] Biber.pm:3976> DEBUG - aca => mm,,अच,अच,,0
[1265] Biber.pm:3976> DEBUG - aṭa => mm,,अट,अट,,0
[1266] Biber.pm:3976> DEBUG - ata => mm,,अत,अत,,0
[1266] Biber.pm:3976> DEBUG - Atharvavedapariśiṣṭa => mm,,अथर्ववेदपरिशिष्ट,अथर्ववेदपरिशिष्ट,,0
[1266] Biber.pm:3976> DEBUG - apa => mm,,अप,अप,,0
[1266] Biber.pm:3976> DEBUG - Abhayapaddhati => mm,,अभयपद्धति,अभयपद्धति,,0
[1266] Biber.pm:3976> DEBUG - Amoghapāśakalparāja => mm,,अमोघपाशकल्पराज,अमोघपाशकल्पराज,,0
[1266] Biber.pm:3976> DEBUG - aya => mm,,अय,अय,,0
[1266] Biber.pm:3976> DEBUG - ara => mm,,अर,अर,,0
[1266] Biber.pm:3976> DEBUG - Arthaśāstra => mm,,अर्थशास्त्र,अर्थशास्त्र,,0
[1266] Biber.pm:3976> DEBUG - ala => mm,,अल,अल,,0
[1266] Biber.pm:3976> DEBUG - Alaṃkārakārikā => mm,,अलंकारकारिका,अलंकारकारिका,,0
[1266] Biber.pm:3976> DEBUG - ava => mm,,अव,अव,,0
[1266] Biber.pm:3976> DEBUG - aśa => mm,,अश,अश,,0
[1266] Biber.pm:3976> DEBUG - aṣa => mm,,अष,अष,,0
[1267] Biber.pm:3976> DEBUG - asa => mm,,अस,अस,,0
[1267] Biber.pm:3976> DEBUG - aha => mm,,अह,अह,,0
[1267] Biber.pm:3976> DEBUG - Īśānaśivagurudevapaddhati => mm,,ईशानशिवगुरुदेवपद्धति,ईशानशिवगुरुदेवपद्धति,,0
[1267] Biber.pm:3976> DEBUG - Ṛgvidhāna => mm,,ऋग्विधान,ऋग्विधान,,0
[1267] Biber.pm:3976> DEBUG - Kalyāṇakāmadhenu => mm,,कल्याणकामधेनु,कल्याणकामधेनु,,0
[1267] Biber.pm:3976> DEBUG - Kiraṇatantra => mm,,किरणतन्त्र,किरणतन्त्र,,0
[1267] Biber.pm:3976> DEBUG - Kuṭṭanīmata => mm,,कुट्टनीमत,कुट्टनीमत,,0
[1267] Biber.pm:3976> DEBUG - Kubjikāmatatantra => mm,,कुब्जिकामततन्त्र,कुब्जिकामततन्त्र,,0
[1267] Biber.pm:3976> DEBUG - Kṛṣṇayamāritantrapañjikā => mm,,कृष्णयमारितन्त्रपञ्जिका,कृष्णयमारितन्त्रपञ्जिका,,0
[1267] Biber.pm:3976> DEBUG - Guhyasamājatantra => mm,,गुह्यसमाजतन्त्र,गुह्यसमाजतन्त्र,,0
[1267] Biber.pm:3976> DEBUG - Guhyasamājamaṇḍalavidhi => mm,,गुह्यसमाजमण्डलविधि,गुह्यसमाजमण्डलविधि,,0
[1267] Biber.pm:3976> DEBUG - Guhyasiddhi => mm,,गुह्यसिद्धि,गुह्यसिद्धि,,0
[1267] Biber.pm:3976> DEBUG - Caṇḍamahāroṣaṇatantra => mm,,चण्डमहारोषणतन्त्र,चण्डमहारोषणतन्त्र,,0
[1267] Biber.pm:3976> DEBUG - Caṇḍamahāroṣaṇatantrapañjikā => mm,,चण्डमहारोषणतन्त्रपञ्जिका पद्मावती,चण्डमहारोषणतन्त्रपञ्जिका पद्मावती,,0
[1268] Biber.pm:3976> DEBUG - Chandaḥsaṃgraha => mm,,छन्दःसंग्रह,छन्दःसंग्रह,,0
[1268] Biber.pm:3976> DEBUG - Chandaḥsāra => mm,,छन्दःसार,छन्दःसार,,0
[1268] Biber.pm:3976> DEBUG - Jayākhyasaṃhitā => mm,,जयाख्यसंहिता,जयाख्यसंहिता,,0
[1268] Biber.pm:3976> DEBUG - Jyotiḥsāra => mm,,ज्योतिःसार,ज्योतिःसार,,0
[1268] Biber.pm:3976> DEBUG - Tattvaratnāvalī => mm,,तत्त्वरत्नावली,तत्त्वरत्नावली,,0
[1268] Biber.pm:3976> DEBUG - Tantrasadbhāva => mm,,तन्त्रसद्भाव,तन्त्रसद्भाव,,0
[1268] Biber.pm:3976> DEBUG - Tantrāloka => mm,,तन्त्रालोक,तन्त्रालोक,,0
[1268] Biber.pm:3976> DEBUG - Divyāvadāna => mm,,दिव्यावदान,दिव्यावदान,,0
[1268] Biber.pm:3976> DEBUG - Derge => mm,,देर्गे,देर्गे,,0
[1268] Biber.pm:3976> DEBUG - Nityādisaṅgrahābhidhānapaddhati => mm,,नित्यादिसङ्ग्रहाभिधानपद्धति,नित्यादिसङ्ग्रहाभिधानपद्धति,,0
[1268] Biber.pm:3976> DEBUG - Niśvāsakārikā => mm,,निश्वासकारिका,निश्वासकारिका,,0
[1268] Biber.pm:3976> DEBUG - Niśvāsatattvasaṃhitā => mm,,निश्वासतत्त्वसंहिता,निश्वासतत्त्वसंहिता,,0
[1268] Biber.pm:3976> DEBUG - Parākhyatantra => mm,,पराख्यतन्त्र,पराख्यतन्त्र,,0
[1269] Biber.pm:3976> DEBUG - Pārameśvaratantra => mm,,पारमेश्वरतन्त्र,पारमेश्वरतन्त्र,,0
[1269] Biber.pm:3976> DEBUG - Pūrva-Kāmika => mm,,पूर्व-कामिक,पूर्व-कामिक,,0
[1269] Biber.pm:3976> DEBUG - Pratiṣṭhālakṣaṇasārasamuccaya => mm,,प्रतिष्ठालक्षणसारसमुच्चय,प्रतिष्ठालक्षणसारसमुच्चय,,0
[1269] Biber.pm:3976> DEBUG - Bṛhatsaṃhitā => mm,,बृहत्संहिता,बृहत्संहिता,,0
[1269] Biber.pm:3976> DEBUG - Brahmayāmalatantra => mm,,ब्रह्मयामलतन्त्र,ब्रह्मयामलतन्त्र,,0
[1269] Biber.pm:3976> DEBUG - Bhairavapadmāvatīkalpa => mm,,भैरवपद्मावतीकल्प,भैरवपद्मावतीकल्प,,0
[1269] Biber.pm:3976> DEBUG - Mañjuśriyamūlakalpa => mm,,मञ्जुश्रियमूलकल्प,मञ्जुश्रियमूलकल्प,,0
[1269] Biber.pm:3976> DEBUG - Mataṅgapārameśvarāgama => mm,,मतङ्गपारमेश्वरागम,मतङ्गपारमेश्वरागम,,0
[1269] Biber.pm:3976> DEBUG - Mālinīvijayottaratantra => mm,,मालिनीविजयोत्तरतन्त्र,मालिनीविजयोत्तरतन्त्र,,0
[1269] Biber.pm:3976> DEBUG - Muktāvalī => mm,,मुक्तावली,मुक्तावली,,0
[1269] Biber.pm:3976> DEBUG - Mṛgendratantra => mm,,मृगेन्द्रतन्त्र,मृगेन्द्रतन्त्र,,0
[1269] Biber.pm:3976> DEBUG - Rauravasūtrasaṅgraha => mm,,रौरवसूत्रसङ्ग्रह,रौरवसूत्रसङ्ग्रह,,0
[1270] Biber.pm:3976> DEBUG - Laghutantraṭīkā => mm,,लघुतन्त्रटीका,लघुतन्त्रटीका,,0
[1270] Biber.pm:3976> DEBUG - Laghuśaṃvaratantra => mm,,लघुशंवरतन्त्र,लघुशंवरतन्त्र,,0
[1270] Biber.pm:3976> DEBUG - Vajrāvalī => mm,,वज्रावली,वज्रावली,,0
[1270] Biber.pm:3976> DEBUG - Vimalaprabhā => mm,,विमलप्रभा,विमलप्रभा,,0
[1270] Biber.pm:3976> DEBUG - Vīṇāśikhatantra => mm,,वीणाशिखतन्त्र,वीणाशिखतन्त्र,,0
[1270] Biber.pm:3976> DEBUG - Śāradātilaka => mm,,शारदातिलक,शारदातिलक,,0
[1270] Biber.pm:3976> DEBUG - Śivatattvaratnākara => mm,,शिवतत्त्वरत्नाकर,शिवतत्त्वरत्नाकर,,0
[1270] Biber.pm:3976> DEBUG - Sampuṭatantraprakaraṇārthanirṇaya => mm,,सम्पुटतन्त्रप्रकरणार्थनिर्णय,सम्पुटतन्त्रप्रकरणार्थनिर्णय,,0
[1270] Biber.pm:3976> DEBUG - Sampuṭodbhavatantra => mm,,सम्पुटोद्भवतन्त्र,सम्पुटोद्भवतन्त्र,,0
[1270] Biber.pm:3976> DEBUG - Sarvatathāgatatattvasaṅgraha => mm,,सर्वतथागततत्त्वसङ्ग्रह,सर्वतथागततत्त्वसङ्ग्रह,,0
[1270] Biber.pm:3976> DEBUG - Sarvatathāgatādhiṣṭhānasattvāvalokanabuddhakṣetrasaṃdarśanavyūha => mm,,सर्वतथागताधिष्ठानसत्त्वावलोकनबुद्धक्षेत्रसंदर्शनव्यूह,सर्वतथागताधिष्ठानसत्त्वावलोकनबुद्धक्षेत्रसंदर्शनव्यूह,,0
[1270] Biber.pm:3976> DEBUG - Sarvajñānottaratantra => mm,,सर्वज्ञानोत्तरतन्त्र,सर्वज्ञानोत्तरतन्त्र,,0
[1271] Biber.pm:3976> DEBUG - Sarvajñānottaravṛtti => mm,,सर्वज्ञानोत्तरवृत्ति,सर्वज्ञानोत्तरवृत्ति,,0
[1271] Biber.pm:3976> DEBUG - Sādhanamālā => mm,,साधनमाला,साधनमाला,,0
[1271] Biber.pm:3976> DEBUG - Sārdhatriśatikālottara => mm,,सार्धत्रिशतिकालोत्तर,सार्धत्रिशतिकालोत्तर,,0
[1271] Biber.pm:3976> DEBUG - Siddhayogeśvarīmata => mm,,सिद्धयोगेश्वरीमत,सिद्धयोगेश्वरीमत,,0
[1271] Biber.pm:3976> DEBUG - Siddhaikavīratantra => mm,,सिद्धैकवीरतन्त्र,सिद्धैकवीरतन्त्र,,0
[1271] Biber.pm:3976> DEBUG - Saurasaṃhitā => mm,,सौरसंहिता,सौरसंहिता,,0
[1271] Biber.pm:3976> DEBUG - Svacchandatantra => mm,,स्वच्छन्दतन्त्र,स्वच्छन्दतन्त्र,,0
[1271] Biber.pm:3976> DEBUG - Svāyambhuvapāñcarātra => mm,,स्वायम्भुवपाञ्चरात्र,स्वायम्भुवपाञ्चरात्र,,0
[1271] Biber.pm:3976> DEBUG - Svāyambhuvasūtrasaṅgraha => mm,,स्वायम्भुवसूत्रसङ्ग्रह,स्वायम्भुवसूत्रसङ्ग्रह,,0
[1271] Biber.pm:3976> DEBUG - Harṣacarita => mm,,हर्षचरित,हर्षचरित,,0
[1271] Biber.pm:3976> DEBUG - Hevajratantra => mm,,हेवज्रतन्त्र,हेवज्रतन्त्र,,0
[1271] Biber.pm:3976> DEBUG - Jñānaratnāvalī => mm,,ज्ञानरत्नावली,ज्ञानरत्नावली,,0

do they look OK?

ppasedach commented 6 years ago

In this debugging output the IAST looks garbled: Diacritics slide off from their respective base letters to the following ones. Just compare the strings with those of the input file, then you'll see it. I haven't checked it in every way, but looking at a few this seems to happen throughout. This does not seem to affect the Devanāgarī side of things, which on a cursory look seems o.k., apart from the sorting issue of ṃ (I would expect aka to be sorted before aṃka etc, but ala and alaṃkāra° seem o.k. again.), ḥ and jñ (I would expect this ligature to be treated as separate letters, not a letter in its own right, probably coming at penultimate (?) position. I would suspect then also the ligature kṣ is treated as one letter by the collation algorithm, and then sorted in the last position, which at least for Sanskrit you would not want normally).

moewew commented 6 years ago

I think you can ignore the IAST, it's copied from the .blg file, that seems to be an artefact of how Biber writes Unicode to the .blg. In the .bbl the IAST is fine. Especially if the Devanāgarī is right I think we can assume that the transliteration works now.

So the only issue left is sorting. I compared Biber's sorting below with various settings in http://anubhav-chattoraj.github.io/indic-tools/devanagari_sorter/

अंक
अंच
अंट
अंत
अंप
अंय
अंर
अंल
अंव
अंश
अंष
अंस
अंह
अःक
अःच
अःट
अःत
अःप
अःय
अःर
अःल
अःव
अःश
अःष
अःस
अःह
अक
अग्निपुराण
अग्निवेश्यगृह्यसूत्र
अच
अट
अत
अथर्ववेदपरिशिष्ट
अप
अभयपद्धति
अमोघपाशकल्पराज
अय
अर
अर्थशास्त्र
अल
अलंकारकारिका
अव
अश
अष
अस
अह
ईशानशिवगुरुदेवपद्धति
ऋग्विधान
कल्याणकामधेनु
किरणतन्त्र
कुट्टनीमत
कुब्जिकामततन्त्र
कृष्णयमारितन्त्रपञ्जिका
गुह्यसमाजतन्त्र
गुह्यसमाजमण्डलविधि
गुह्यसिद्धि
चण्डमहारोषणतन्त्र
चण्डमहारोषणतन्त्रपञ्जिका पद्मावती
छन्दःसंग्रह
छन्दःसार
जयाख्यसंहिता
ज्योतिःसार
तत्त्वरत्नावली
तन्त्रसद्भाव
तन्त्रालोक
दिव्यावदान
देर्गे
नित्यादिसङ्ग्रहाभिधानपद्धति
निश्वासकारिका
निश्वासतत्त्वसंहिता
पराख्यतन्त्र
पारमेश्वरतन्त्र
पूर्व-कामिक
प्रतिष्ठालक्षणसारसमुच्चय
बृहत्संहिता
ब्रह्मयामलतन्त्र
भैरवपद्मावतीकल्प
मञ्जुश्रियमूलकल्प
मतङ्गपारमेश्वरागम
मालिनीविजयोत्तरतन्त्र
मुक्तावली
मृगेन्द्रतन्त्र
रौरवसूत्रसङ्ग्रह
लघुतन्त्रटीका
लघुशंवरतन्त्र
वज्रावली
विमलप्रभा
वीणाशिखतन्त्र
शारदातिलक
शिवतत्त्वरत्नाकर
सम्पुटतन्त्रप्रकरणार्थनिर्णय
सम्पुटोद्भवतन्त्र
सर्वतथागततत्त्वसङ्ग्रह
सर्वतथागताधिष्ठानसत्त्वावलोकनबुद्धक्षेत्रसंदर्शनव्यूह
सर्वज्ञानोत्तरतन्त्र
सर्वज्ञानोत्तरवृत्ति
साधनमाला
सार्धत्रिशतिकालोत्तर
सिद्धयोगेश्वरीमत
सिद्धैकवीरतन्त्र
सौरसंहिता
स्वच्छन्दतन्त्र
स्वायम्भुवपाञ्चरात्र
स्वायम्भुवसूत्रसङ्ग्रह
हर्षचरित
हेवज्रतन्त्र
ज्ञानरत्नावली

I got consistently different results for ज्ञानरत्नावली/ Jñānaratnāvalī (Biber sorts it at the end, the quoted webpage at position 62 between जयाख्यसंहिता/ Jayākhyasaṃhitā and ज्योतिःसार/ Jyotiḥsāra) and सर्वज्ञानोत्तरतन्त्र/ Sarvajñānottaratantra सर्वज्ञानोत्तरवृत्ति/ Sarvajñānottaravṛtti (Biber sorts them after सर्वतथागततत्त्वसङ्ग्रह/ Sarvatathāgatatattvasaṅgraha and सर्वतथागताधिष्ठानसत्त्वावलोकनबुद्धक्षेत्रसंदर्शनव्यूह/ Sarvatathāgatādhiṣṭhānasattvāvalokanabuddhakṣetrasaṃdarśanavyūha the webpage before). So all of this seems to be only about

plk commented 6 years ago

Yes, don't worry too much about what is pasted here or what your text editor/terminal displays in the .blg unless you understand how it handles UTF-8 in terms of composed/decomposed form. What matters is the PDF output.

ppasedach commented 6 years ago

That website you linked gives you the option to sort the jñ as a separate letter, (as well as the kṣ and tr, for which I should add something to the example), activating which didn't make any difference. But this seems to be a sorting convention used by some people.

Here now a new shorter example which confirms that kṣ is also sorted as a separate letter at the end, which, at least for Sanskrit, it should not. tr is sorted at the proper place, so the problem has now boiled down to jñ and kṣ.

The sorting of ṃ and ḥ is o.k. as it is.

\documentclass{article}
\listfiles
\usepackage{polyglossia}
\setdefaultlanguage{sanskrit}
\newfontfamily\sanskritfont{Latin Modern Roman}
\usepackage{fontspec}
\usepackage{biblatex}
\usepackage{filecontents}
\addbibresource{\jobname.bib}
\begin{filecontents*}{\jobname.bib}

@misc{kumāra,
title = {kumāra},
}

@misc{kṣetra,
title = {kṣetra},
}

@misc{kha,
title = {kha},
}

@misc{jīvita,
title = {jīvita},
}

@misc{jñāna,
title = {jñāna},
}

@misc{jvara,
title = {jvara},
}

@misc{tyāga,
title = {tyāga},
}

@misc{tridaśa,
title = {tridaśa},
}

@misc{tvid,
title = {tvid},
}

\end{filecontents*}

\DeclareSortTranslit{
  \translit[title]{iast}{devanagari}
}
\begin{document}
\nocite{*}
\printbibliography
\end{document}
moewew commented 6 years ago

Do you have any source for the complete sorting rules that you would like to see applied?

If I understand correctly Devanāgarī is a script and scripts do not necessarily determine the sorting uniquely language-specific rules have to be taken into account as well. Take for example the different sortings of Ö in Swedish and German.

See also Q16 What about collation of Indic language data? in http://unicode.org/faq/indic.html#16 and http://www.unicode.org/notes/tn1/ (https://www.unicode.org/notes/tn1/Wissink-IndicCollation.pdf), esp. p. 5

plk commented 6 years ago

As far as I can see, there are currently no alternative tailorings for sanskrit: https://metacpan.org/pod/Unicode::Collate::Locale#A-list-of-tailorable-locales

You might look at the references here to see which UCA sanskrit collation is being used and it is then possible to submit a request to the author of Unicode::Collate::Locale for alternative collations if they are available in the UCA.

moewew commented 6 years ago

With sortlocale=hi

\documentclass{article}
\listfiles
\usepackage{polyglossia}
\setdefaultlanguage{sanskrit}
\newfontfamily\sanskritfont{Latin Modern Roman}
\usepackage{fontspec}
\usepackage[sortlocale=hi]{biblatex}
\usepackage{filecontents}
\addbibresource{\jobname.bib}
\begin{filecontents*}{\jobname.bib}

@misc{kumāra,
title = {kumāra},
}

@misc{kṣetra,
title = {kṣetra},
}

@misc{kha,
title = {kha},
}

@misc{jīvita,
title = {jīvita},
}

@misc{jñāna,
title = {jñāna},
}

@misc{jvara,
title = {jvara},
}

@misc{tyāga,
title = {tyāga},
}

@misc{tridaśa,
title = {tridaśa},
}

@misc{tvid,
title = {tvid},
}

\end{filecontents*}

\DeclareSortTranslit{
  \translit[title]{iast}{devanagari}
}
\begin{document}
\nocite{*}
\printbibliography
\end{document}

gives caakkc.pdf

[1] kumāra. [2] kṣetra. [3] kha. [4] jīvita. [5] jñāna. [6] jvara. [7] tyāga. [8] tridaśa. [9] tvid.

See also Q16 What about collation of Indic language data? in http://unicode.org/faq/indic.html#16 and http://www.unicode.org/notes/tn1/ (https://www.unicode.org/notes/tn1/Wissink-IndicCollation.pdf), esp. p. 5

moewew commented 6 years ago

@plk Would it make sense and be possible to enable the sortlocale option and possibly other options like sortcase and sortupper on a per-refcontext basis? What about the commands of §4.5.6 Sorting?

plk commented 6 years ago

Well, it already is because sorting is a refcontext argument and all of those things can be set as part of a sorting template ...

moewew commented 6 years ago

Oh yes, I hadn't seen the locale option to \DeclareSortingTemplate, sorry.

Can one do something similar for \DeclareSortExclusion and friends and \DeclareSortTranslit? The latter was actually why I'm asking. I guess it would make sense to have IAST-transliterated sources and other sources in different refcontext and I would only want to enable the conversion for the IAST-refcontext and not the normal context.

plk commented 6 years ago

Hmm, not trivial to do this. These options are inherently global as they are preamble only. Can you see any use-case for this? Such things seem very global ...

moewew commented 6 years ago

I can definitely see a use in restricting \DeclareSortTranslit. Suppose I have Indic and Latin reference in the same document. I want my Indic references to follow IAST transliteration and then Sanskrit sorting, but naturally my Latin sources should follow the usual Latin sorting.

\documentclass{article}
\listfiles
\usepackage{polyglossia}
\setdefaultlanguage{sanskrit}
\newfontfamily\sanskritfont{Latin Modern Roman}
\usepackage{fontspec}
\usepackage{biblatex}
\usepackage{filecontents}
\addbibresource{\jobname.bib}
\begin{filecontents*}{\jobname.bib}
@misc{kumāra,
title = {kumāra},
keywords = {indic},
}

@misc{kṣetra,
title = {kṣetra},
keywords = {indic},
}

@misc{kha,
title = {kha},
keywords = {indic},
}

@misc{jīvita,
title = {jīvita},
keywords = {indic},
}

@misc{jñāna,
title = {jñāna},
keywords = {indic},
}

@misc{jvara,
title = {jvara},
keywords = {indic},
}

@misc{tyāga,
title = {tyāga},
keywords = {indic},
}

@misc{tridaśa,
title = {tridaśa},
keywords = {indic},
}

@misc{tvid,
title = {tvid},
keywords = {indic},
}

@misc{aachen,
title = {Aachen},
}

@misc{augsburg,
title = {Augsburg},
}

@misc{arnhem,
title = {Arnhem},
}

@misc{avignon,
title = {Avignon},
}

@misc{aix-en-provence,
title = {Aix-en-Provence},
}

@misc{berlin,
title = {Berlin},
}

@misc{utrecht,
title = {Utrecht},
}

@misc{zeven,
title = {Zeven},
}
\end{filecontents*}

\DeclareSortTranslit{
  \translit[title]{iast}{devanagari}
}
\begin{document}
\nocite{*}
\printbibliography[keyword=indic]
\printbibliography[notkeyword=indic]
\end{document}

sorts my Latin sources in their nonsense Devanāgarī form.

spluststs

From trace

[537] Biber.pm:3976> DEBUG - zeven => mm,,Zएवेन्,Zएवेन्,,0
[537] Biber.pm:3976> DEBUG - aachen => mm,,अअछेन्,अअछेन्,,0
[537] Biber.pm:3976> DEBUG - arnhem => mm,,अर्न्हेम्,अर्न्हेम्,,0
[537] Biber.pm:3976> DEBUG - avignon => mm,,अविग्नोन्,अविग्नोन्,,0
[537] Biber.pm:3976> DEBUG - utrecht => mm,,उत्रेछ्त्,उत्रेछ्त्,,0
[537] Biber.pm:3976> DEBUG - aix-en-provence => mm,,ऐx-एन्-प्रोवेन्चे,ऐx-एन्-प्रोवेन्चे,,0
[537] Biber.pm:3976> DEBUG - augsburg => mm,,औग्स्बुर्ग्,औग्स्बुर्ग्,,0
plk commented 6 years ago

Right, I see. I don't think a per-refcontext setting will fix this. What about an optional arg that makes transliteration apply only to entries with particular langids? I think this is probably the best solution.

moewew commented 6 years ago

Mhhh, yes the example was a bit too sparse on that front. I would could have started a new refcontext for the Latin bibliography and then it would work.

I feel that \DeclareSortTranslit (and \DeclareSortExclusion, \DeclareSortInclusion and \DeclarePresort) are intimately tied to sorting and since that is essentially per-refcontext I though it natural to have those settings per-refcontext (or per-sorting-template similar to locale ...) as well.

Per-langid sort translit would certainly solve this problem and I can't really think of a different setting where it would be inferior to per-refcontext translit. I still like the idea of per-refcontext, but since you will have to implement it I'll defer to your judgment.

plk commented 6 years ago

Please try dev 3.12 and biber dev 2.12. \translit now has a changed parameter sequence with an optional csv of langids to apply the \translit to. I think this is the best solution as transliteration applies to languages. It seems to fix your example in my tests and allows for transliterated and non-transliterated sorting in the same reference list.

moewew commented 6 years ago

The example works fine with 3.12/2.12 dev. Thank you very much.

\documentclass{article}
\listfiles
\usepackage{polyglossia}
\setdefaultlanguage{sanskrit}
\newfontfamily\sanskritfont{Latin Modern Roman}
\usepackage{fontspec}
\usepackage{biblatex}
\usepackage{filecontents}
\addbibresource{\jobname.bib}
\begin{filecontents*}{\jobname.bib}
@misc{kumāra,
title = {kumāra},
keywords = {indic},
langid = {hi},
}

@misc{kṣetra,
title = {kṣetra},
keywords = {indic},
langid = {hi},
}

@misc{kha,
title = {kha},
keywords = {indic},
langid = {hi},
}

@misc{jīvita,
title = {jīvita},
keywords = {indic},
langid = {hi},
}

@misc{jñāna,
title = {jñāna},
keywords = {indic},
langid = {hi},
}

@misc{jvara,
title = {jvara},
keywords = {indic},
langid = {hi},
}

@misc{tyāga,
title = {tyāga},
keywords = {indic},
langid = {hi},
}

@misc{tridaśa,
title = {tridaśa},
keywords = {indic},
langid = {hi},
}

@misc{tvid,
title = {tvid},
keywords = {indic},
langid = {hi},
}

@misc{aachen,
title = {Aachen},
}

@misc{augsburg,
title = {Augsburg},
}

@misc{arnhem,
title = {Arnhem},
}

@misc{avignon,
title = {Avignon},
}

@misc{aix-en-provence,
title = {Aix-en-Provence},
}

@misc{berlin,
title = {Berlin},
}

@misc{utrecht,
title = {Utrecht},
}

@misc{zeven,
title = {Zeven},
}
\end{filecontents*}

\DeclareSortTranslit{
  \translit[hindi]{*}{iast}{devanagari}
}
\begin{document}
\nocite{*}
\printbibliography[keyword=indic]
\printbibliography[notkeyword=indic]
\end{document}
moewew commented 6 years ago

You may want to change the example for \DeclareSortTranslit in the manual to use the new syntax.

plk commented 6 years ago

Technically, you are right about the other three global sorting macros. However, I would rather wait and see if anyone really needs this and can give a convincing example. I think it's not that likely that anyone needs to vary these things within a document as this would mean that some settings used in one part of the document explicitly did not work with other parts. However, these settings are fairly general and would usually apply generally.

moewew commented 6 years ago

Fair enough.

Do you want me to fix up the example for \DeclareSortTranslit in the docs to use the new syntax or will you do that?

plk commented 6 years ago

It's done, just have to push it.

ppasedach commented 6 years ago

I had some months back installed the development version of biblatex into my ~/texmf/ tree, now I want to make my project portable, is it enough to keep biblatex.sty in the project directory, and use biber 2.12, or do I need any other files as well from the development version?

moewew commented 6 years ago

@ppasedach The dev version is in flow and so I don't know which exact version you got. Assuming that everything works fine so far you are probably good with only your version of biblatex.sty. The changes in other files is negligible for most intents and purposes at the moment, so even if you pulled the versions now, chances are that you would be OK with biblatex.sty and Biber only. But there is no guarantee and lots of things depend on which features you use.

That all said, I can not recommend using the dev versions for production work. And I strongly recommend not disseminating the development versions to other people (I'm not sure if that is what you ultimately have in mind when you want to make your project portable).

ppasedach commented 6 years ago

The project is a book, being developed in a private repository, so no dissemination of biblatex's and biber's development versions to others apart from one more collaborator who hardly touches the LaTeX sources. The point of my question was just about being able to quickly move my book project to some other computer without needing to modify the TeX Live installation there. I am still using the development version from August, but could also update to a newer version, if that's advisable, but of course I also understand the point about better not using dev versions for production work. Or, has the bug fix been incorporated into the stable versions, or could that be done without much effort? Then of course I'd prefer to use the stable versions.

moewew commented 6 years ago

You should be fine with just biblatex.sty and Biber.

If things work for you on your current machines and on the other target machines as well, there is no need to get a newer development version. Of course that only holds if you can use your version of Biber and biblatex.sty on the other machines. If you are switching between Windows and Linux for example you need to make sure that your development versions of Biber are from the same snapshot (or at least compatible), which would probably mean that you would have to pull all involved binaries anew now; you would then also need a the current dev biblatex.sty.

There has not been an update to the release versions of either Biber or biblatex since February (v3.11/v2.11). That means that this bug fix is not available in the stable versions yet. Since this fix involves a new Biber version it can't be deployed as easily since that requires that the binaries be built (and not only for the three standard systems, but also for a few others), which usually takes some time and involves more people than just PLK. Given the recent developments I'd say we should push out a new version soon, but there are still a few things that would have to be taken care of before we can release it (no ETA for any of that at the moment, though).