Open BenjaminGalliot opened 3 months ago
I've been able to reproduce this. I needed to be on linux in a directory that is on an ext4 file system. It appears that when biber tries to open the .blg file, the name it uses is in NFD instead of NFC, even when the directory name is specified in the NFC form. This causes exactly the error message shown when the directory name contains an accented character.
The actual listing in the bug report of the error message ("Can't open résultats/test.blg") is in NFC, presumably because of something done by a pasting operation in a web browser.
On macOS and APFS (which is normalization insensitive, but normalization preserving), when the directory name does not contain an accented character, but the base name of the .tex file does contain an accented character, then the name of the .blg file is in NFD. In contrast, the .bbl filename is in NFC. This is given that the name of the .tex is in NFC.
The version of biber is 2.20 (in TeXLive 2024).
On combinations of OS and file systems (e.g.,macOS and APFS) that are insensitive to Unicode normalization of filenames, latexmk invoked as in the bug report does not raise an error.
Looks like I forgot to NFC the filename. biber
is all NFD internally and it should NFC everything on output but it looks like this was missed. Can you try biber
2.21 DEV version from SF?
I tried but I'm not familiar enough with Perl to be able to generate the executable to test from the sources! Sorry ! :sweat_smile:
I tried the 2.21.beta version, and it worked, provided that the file and directory names on linux were all NFC.
But on a Unicode-normalization-sensitive system, it now fails if the names aren't NFC. That's unlikely to be the case for most users in Western Europe, since when typing in characters, typical keyboard layouts give pre-composed characters, i.e., NFC. So they will create files with NFC names. At least if the files are created within the linux
However, on macOS, suppose I have a file or directory whose name is NFC. Then I rename the file in the Finder, without even touching the non-ASCII characters. After the rename, the name is NFD! I've seen complaints about that on the web. (Korean users seem to be particularly bothered.) Command line commands (mv, etc) don't have this problem.
Luckily, at least by default, the macOS and its file systems are Unicode-normalization insensitive, so this issue doesn't seem to be too big a deal for our purposes. But transferring the files to linux could cause all kinds of interesting anomalies! It might be useful to have a little script to rename all files and directories to have a particular normalization. Perhaps one already exists.
Pdflatex, at least on TeXLive 2024, preserves Unicode normalization from the .tex filename to the names of generated files, and the same applies to latexmk. I haven't tried this with xelatex and lualatex, but I would conjecture they have the same behavior.
Would it not be better for biber to preserve the normalization of what's on the command line, since that matches better the behavior of the other programs involved? (With latexmk I went through an initial phase of thinking the internal use of NFD would be a good idea; there are recommendations that that is the "correct" thing to do. But that led to a minefield of other complications, so I abandoned that.) What problems would that changed behavior lead to?
John
On 3/29/24 11:43 AM, plk wrote:
Looks like I forgot to NFC the filename. |biber| is all NFD internally and it should NFC everything on output but it looks like this was missed. Can you try |biber| 2.21 DEV version from SF?
— Reply to this email directly, view it on GitHub https://github.com/plk/biber/issues/474#issuecomment-2027402356, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXT47OWWNSLJ2IWOEKF2PLY2V4YXAVCNFSM6AAAAABFM5IA6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXGQYDEMZVGY. You are receiving this because you commented.Message ID: @.***>
Well, you have to use NFD internally because there lots of tricky things that have to be done with independent combining chars etc. I can however, have a look at preserving filenames from the form of the .bcf
file.
Please try 2.21 from SF again.
On 4/13/24 2:19 PM, plk wrote:> > Well, you have to use NFD internally because there lots of tricky things that
have to be done with independent combining chars etc. I can however, have a look at preserving filenames from the form of the .tex file.
For the textual content of things like the author fields in .bib files, I agree that the internal use of NFD is suitable. That's because for ordinary text, characters that differ by normalization are intended to be equivalent.
But for filenames, things are entirely different. The combinations of Windows with NTFS and FAT32, and linux with ext4 (and IIRC FAT32) are all normalization sensitive. E.g., in these cases it's perfectly possible to have two different files whose names are identical except for Unicode normalization.
So preservation of Unicode normalization of filenames is compulsory, as far as I can see. There are effectively two different worlds of strings: Those for ordinary text and those for filenames.
Of course, if you are typing filenames in Windows and linux, you are probably going to get only NFC filenames, at least with standard keyboard layouts for many Western European languages.
But it's easily possible to get NFD filenames if you generate the files on macOS and transfer by a normalization-preserving method (e.g., in a zip file from macOS to unix). That's because GUIs in macOS coerce filenames to NFD. That doesn't matter much on macOS, since by default it is insensitive to the Unicode normalization of filenames. But once the files are on linux or Windows, there are complications.
John
On 4/14/24 10:27 AM, plk wrote:>
Please try 2.21 from SF again.
Sorry, but it doesn't work. I see at least the following problems
I can reproduce this kind of situation in a Perl script if I do
use utf8;
my $orig = 'NFCé';
my $enc1 = encode( 'UTF-8', $orig );
my $enc2 = encode( 'UTF-8', $enc1 );
The string $enc2 has content that is the UTF-8 encoding of 'NFCé'. The string $enc1 has the correct UTF-8 encoding of the original string.
The same error occurs in the blg file for the strings for the names of .bbl file. I've attached a zip file containing an example.
If the OS is linux, and the bcf file is in a directory named résultats, this version of biber, just like 2.20, still tries to write a .blg file whose name is the NFD version of what it should be writing. That gives a fatal error, since there is no directory whose name is the NFD version of 'résultats'.
John
--------------iZI3vq0Dv8G5a1I7JF6nqX00 Content-Type: application/zip; name="biber-issue.zip" Content-Disposition: attachment; filename="biber-issue.zip" Content-Transfer-Encoding: base64
UEsDBBQAAAAIAM9YklhR9cpGEAIAAFoEAAAOABwATkZDLWNhZmXMgS5ibGdVVAkAA/Y2IWZgOCFm dXgLAAEECQIAAAQUAAAArVNNjpswFN7nFG9HKw0EEwIkEpXaVJEqjZKqk6iVoiwMMcQaByNj0mZ2 VU/Ta3S2PVRtl0Cgs6yEBMbv+3mfn3fuHha8yGjulKf5BPlv4MNquQYbNkdagXre0YQI8BwPwauE SPx6tBtgwhZzz/OMMqJh1mq5sFOckd/fnYTl1mjnRXtINJmtyRzNZZ+9+cTvNOM4Vro1vC0FoOgO PNfz7wChuRvMJ+5oh9zZ/q8hrex7bov8RPCBFjlYzrgRfv7pJGmmdJE/uQHNZp3cktfFARCkVJJH clHtFtohVCSVlBegFafuraIfTFr0R8FTUlVatQeY3gKCqX+TDn/U1RkXWkaSb2DSstSCUZ4LXB4p cdTKMjU91mDfYDQtClHn4x5vyBc4kJSbBBzH0fVRr36KZoOulUUDwxJDxWuRvuhDMYXeHrYLzhiW RHMFUcu0PhMhqFFlPMW6FVLY2wdLuclwzaQagzMWFCdqK4bqSDNJDhZ8pfLY2yl4YdO84EKv/4Oo YjphRp+wyS+G1fL9VXW4VQpy/aWsNdI3IxZ1QT9wIY0sraQikpdxzniC2fj67r0s4BnIS2n8SXFp DEhyKnVXhsACrI6i38YLFrqzW3F1WEKCxJRxoc3gs/o0MeqR+ZdKz03CTIZh0PJ8FtS0cr2lzz9+ qeuSsMYkKZphsrabpR0NiMKwM7SuZVkrPxwGTKM/UEsBAh4DFAAAAAgAz1iSWFH1ykYQAgAAWgQA AA4AGAAAAAAAAQAAAKSBAAAAAE5GQy1jYWZlzIEuYmxnVVQFAAP2NiFmdXgLAAEECQIAAAQUAAAA UEsFBgAAAAABAAEAVAAAAFgCAAAAAA==
--------------iZI3vq0Dv8G5a1I7JF6nqX00--
I will have a look - I suspect that the log4perl
module is doing some normalisation which isn't obvious as it's that which creates the .blg
.
Can you please try 2.21 dev again from SF?
Hello,
It seems that the latest version of biber has problems with some Unicode characters in the path (
outdir
of latexmk).Strangely, not all Unicode characters have this problem, and John Collins was unable to reproduce this behavior on his system.
I'm on Linux Manjaro, with the latest version of Texlive 2024 (updated yesterday). The 2023 version, and the 2024 version at the very beginning of the year did not have this problem, which appeared when I updated everything yesterday.