plk / biber

Backend processor for BibLaTeX
Artistic License 2.0
340 stars 38 forks source link

« Wide character in die at -e line 624 » with some Unicode characters in outdir’s path #474

Open BenjaminGalliot opened 8 months ago

BenjaminGalliot commented 8 months ago

Hello,

It seems that the latest version of biber has problems with some Unicode characters in the path (outdir of latexmk).

Strangely, not all Unicode characters have this problem, and John Collins was unable to reproduce this behavior on his system.

I'm on Linux Manjaro, with the latest version of Texlive 2024 (updated yesterday). The 2023 version, and the 2024 version at the very beginning of the year did not have this problem, which appeared when I updated everything yesterday.

Rc files read:
  NONE
Latexmk: This is Latexmk, John Collins, 31 Jan. 2024. Version 4.83.
Latexmk: making output directory 'resultats'
Latexmk: Doing main (small) clean up for 'test.tex'
No existing .aux file, so I'll make a simple one, and require run of *latex.
Force everything to be remade.
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Category 'other':
  Rerun of 'lualatex' forced or previously required:
    Reason or flag: 'go_mode'

------------
Run number 1 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="resultats"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file 'resultats/test.log'
Latexmk: Examining 'resultats/test.fls'
Latexmk: Examining 'resultats/test.log'
Latexmk: Missing bbl file 'resultats/test.bbl' in following:
 No file test.bbl.
Latexmk: References changed.
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: applying rule 'biber resultats/test'...
Rule 'biber resultats/test':  Reasons for rerun
Category 'other':
  Rerun of 'biber resultats/test' forced or previously required:
    Reason or flag: 'Initial set up of rule'

------------
Run number 1 of rule 'biber resultats/test'
------------
------------
Running 'biber  "resultats/test.bcf"'
------------
INFO - This is Biber 2.20
INFO - Logfile is 'resultats/test.blg'
INFO - Reading 'resultats/test.bcf'
INFO - Found 1 citekeys in bib section 0
INFO - Processing section 0
INFO - Looking for bibtex file 'bibliographie.bib' for section 0
INFO - LaTeX decoding ...
INFO - Found BibTeX data source 'bibliographie.bib'
INFO - Overriding locale 'en-US' defaults 'normalization = NFD' with 'normalization = prenormalized'
INFO - Overriding locale 'en-US' defaults 'variable = shifted' with 'variable = non-ignorable'
INFO - Sorting list 'nty/global//global/global/global' of type 'entry' with template 'nty' and locale 'en-US'
INFO - No sort tailoring available for locale 'en-US'
INFO - Writing 'resultats/test.bbl' with encoding 'UTF-8'
INFO - Output to resultats/test.bbl
Latexmk: Found biber source file(s) [bibliographie.bib, resultats/test.bcf]
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Changed files or newly in use/created:
  resultats/test.aux
  resultats/test.bbl
  resultats/test.out

------------
Run number 2 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="resultats"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file 'resultats/test.log'
Latexmk: Examining 'resultats/test.fls'
Latexmk: Examining 'resultats/test.log'
Latexmk: Found input bbl file 'resultats/test.bbl'
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Changed files or newly in use/created:
  resultats/test.aux
  resultats/test.run.xml

------------
Run number 3 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="resultats"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file 'resultats/test.log'
Latexmk: Examining 'resultats/test.fls'
Latexmk: Examining 'resultats/test.log'
Latexmk: Found input bbl file 'resultats/test.bbl'
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Changed files or newly in use/created:
  resultats/test.run.xml

------------
Run number 4 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="resultats"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file 'resultats/test.log'
Latexmk: Examining 'resultats/test.fls'
Latexmk: Examining 'resultats/test.log'
Latexmk: Found input bbl file 'resultats/test.bbl'
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: All targets (resultats/test.pdf) are up-to-date

$ latexmk -lualatex -outdir=résultats -synctex=1 -interaction=batchmode 'test.tex' -gg
Rc files read:
  NONE
Latexmk: This is Latexmk, John Collins, 31 Jan. 2024. Version 4.83.
Latexmk: making output directory 'résultats'
Latexmk: Doing main (small) clean up for 'test.tex'
No existing .aux file, so I'll make a simple one, and require run of *latex.
Force everything to be remade.
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Category 'other':
  Rerun of 'lualatex' forced or previously required:
    Reason or flag: 'go_mode'

------------
Run number 1 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="résultats"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file 'résultats/test.log'
Latexmk: Examining 'résultats/test.fls'
Latexmk: Examining 'résultats/test.log'
Latexmk: Missing bbl file 'résultats/test.bbl' in following:
 No file test.bbl.
Latexmk: References changed.
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: applying rule 'biber résultats/test'...
Rule 'biber résultats/test':  Reasons for rerun
Category 'other':
  Rerun of 'biber résultats/test' forced or previously required:
    Reason or flag: 'Initial set up of rule'

------------
Run number 1 of rule 'biber résultats/test'
------------
------------
Running 'biber  "résultats/test.bcf"'
------------
Wide character in die at -e line 624.
Can't open résultats/test.blg (No such file or directory) at /tmp/par-62656e6a616d696e/cache-8e80c9c14f39e44498a1091586b807a0d52ef04a/inc/lib/Log/Log4perl/Appender/File.pm line 151.
Latexmk: Error return from 'biber résultats/test'
I will add to its source list, anything cached from analysis of bcf file.
Latexmk: Summary of warnings from last run of *latex:
  Latex failed to resolve 1 citation(s)
Latexmk: ====Undefined refs and citations with line #s in .tex file:
  Citation 'jacques21grammar' on page 1 undefined on input line 13
Latexmk: Errors, so I did not complete making targets
Collected error summary (may duplicate other messages):
  biber résultats/test: Could not open biber log file for 'résultats/test'

Latexmk: Sometimes, the -f option can be used to get latexmk
  to try to force complete processing.
  But normally, you will need to correct the file(s) that caused the
  error, and then rerun latexmk.
  In some cases, it is best to clean out generated files before rerunning
  latexmk after you've corrected the files.

$ latexmk -lualatex -outdir=啊啊啊 -synctex=1 -interaction=batchmode 'test.tex' -gg
Rc files read:
  NONE
Latexmk: This is Latexmk, John Collins, 31 Jan. 2024. Version 4.83.
Latexmk: making output directory '啊啊啊'
Latexmk: Doing main (small) clean up for 'test.tex'
No existing .aux file, so I'll make a simple one, and require run of *latex.
Force everything to be remade.
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Category 'other':
  Rerun of 'lualatex' forced or previously required:
    Reason or flag: 'go_mode'

------------
Run number 1 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="啊啊啊"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file '啊啊啊/test.log'
Latexmk: Examining '啊啊啊/test.fls'
Latexmk: Examining '啊啊啊/test.log'
Latexmk: Missing bbl file '啊啊啊/test.bbl' in following:
 No file test.bbl.
Latexmk: References changed.
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: applying rule 'biber 啊啊啊/test'...
Rule 'biber 啊啊啊/test':  Reasons for rerun
Category 'other':
  Rerun of 'biber 啊啊啊/test' forced or previously required:
    Reason or flag: 'Initial set up of rule'

------------
Run number 1 of rule 'biber 啊啊啊/test'
------------
------------
Running 'biber  "啊啊啊/test.bcf"'
------------
INFO - This is Biber 2.20
INFO - Logfile is '啊啊啊/test.blg'
Wide character in print at /tmp/par-62656e6a616d696e/cache-8e80c9c14f39e44498a1091586b807a0d52ef04a/inc/lib/Log/Log4perl/Appender/Screen.pm line 57.
INFO - Reading './啊啊啊/test.bcf'
INFO - Found 1 citekeys in bib section 0
INFO - Processing section 0
INFO - Looking for bibtex file 'bibliographie.bib' for section 0
INFO - LaTeX decoding ...
INFO - Found BibTeX data source 'bibliographie.bib'
INFO - Overriding locale 'en-US' defaults 'normalization = NFD' with 'normalization = prenormalized'
INFO - Overriding locale 'en-US' defaults 'variable = shifted' with 'variable = non-ignorable'
INFO - Sorting list 'nty/global//global/global/global' of type 'entry' with template 'nty' and locale 'en-US'
INFO - No sort tailoring available for locale 'en-US'
INFO - Writing '啊啊啊/test.bbl' with encoding 'UTF-8'
INFO - Output to 啊啊啊/test.bbl
Latexmk: Found biber source file(s) [./啊啊啊/test.bcf, bibliographie.bib]
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Changed files or newly in use/created:
  啊啊啊/test.aux
  啊啊啊/test.bbl
  啊啊啊/test.out

------------
Run number 2 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="啊啊啊"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file '啊啊啊/test.log'
Latexmk: Examining '啊啊啊/test.fls'
Latexmk: Examining '啊啊啊/test.log'
Latexmk: Found input bbl file '啊啊啊/test.bbl'
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Changed files or newly in use/created:
  啊啊啊/test.aux
  啊啊啊/test.run.xml

------------
Run number 3 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="啊啊啊"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file '啊啊啊/test.log'
Latexmk: Examining '啊啊啊/test.fls'
Latexmk: Examining '啊啊啊/test.log'
Latexmk: Found input bbl file '啊啊啊/test.bbl'
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: applying rule 'lualatex'...
Rule 'lualatex':  Reasons for rerun
Changed files or newly in use/created:
  啊啊啊/test.run.xml

------------
Run number 4 of rule 'lualatex'
------------
------------
Running 'lualatex  -synctex=1 -interaction=batchmode -recorder -output-directory="啊啊啊"  "test.tex"'
------------
This is LuaHBTeX, Version 1.18.0 (TeX Live 2024) 
 restricted system commands enabled.
SyncTeX written on test.synctex.gz.
Latexmk: Getting log file '啊啊啊/test.log'
Latexmk: Examining '啊啊啊/test.fls'
Latexmk: Examining '啊啊啊/test.log'
Latexmk: Found input bbl file '啊啊啊/test.bbl'
Latexmk: Log file says output to 'test.pdf'
Latexmk: Bibliography file(s) from .bcf file:
  bibliographie.bib
Latexmk: All targets (啊啊啊/test.pdf) are up-to-date
$ biber résultats/test.bcf
Wide character in die at -e line 624.
Can't open résultats/test.blg (No such file or directory) at /tmp/par-62656e6a616d696e/cache-8e80c9c14f39e44498a1091586b807a0d52ef04a/inc/lib/Log/Log4perl/Appender/File.pm line 151.

$ biber 啊啊啊/test.bcf
INFO - This is Biber 2.20
INFO - Logfile is '啊啊啊/test.blg'
Wide character in print at /tmp/par-62656e6a616d696e/cache-8e80c9c14f39e44498a1091586b807a0d52ef04a/inc/lib/Log/Log4perl/Appender/Screen.pm line 57.
INFO - Reading './啊啊啊/test.bcf'
INFO - Found 1 citekeys in bib section 0
INFO - Processing section 0
INFO - Looking for bibtex file 'bibliographie.bib' for section 0
INFO - LaTeX decoding ...
INFO - Found BibTeX data source 'bibliographie.bib'
INFO - Overriding locale 'en-US' defaults 'normalization = NFD' with 'normalization = prenormalized'
INFO - Overriding locale 'en-US' defaults 'variable = shifted' with 'variable = non-ignorable'
INFO - Sorting list 'nty/global//global/global/global' of type 'entry' with template 'nty' and locale 'en-US'
INFO - No sort tailoring available for locale 'en-US'
INFO - Writing '啊啊啊/test.bbl' with encoding 'UTF-8'
INFO - Output to 啊啊啊/test.bbl
jccollins commented 8 months ago

I've been able to reproduce this. I needed to be on linux in a directory that is on an ext4 file system. It appears that when biber tries to open the .blg file, the name it uses is in NFD instead of NFC, even when the directory name is specified in the NFC form. This causes exactly the error message shown when the directory name contains an accented character.

The actual listing in the bug report of the error message ("Can't open résultats/test.blg") is in NFC, presumably because of something done by a pasting operation in a web browser.

On macOS and APFS (which is normalization insensitive, but normalization preserving), when the directory name does not contain an accented character, but the base name of the .tex file does contain an accented character, then the name of the .blg file is in NFD. In contrast, the .bbl filename is in NFC. This is given that the name of the .tex is in NFC.

The version of biber is 2.20 (in TeXLive 2024).

On combinations of OS and file systems (e.g.,macOS and APFS) that are insensitive to Unicode normalization of filenames, latexmk invoked as in the bug report does not raise an error.

plk commented 8 months ago

Looks like I forgot to NFC the filename. biber is all NFD internally and it should NFC everything on output but it looks like this was missed. Can you try biber 2.21 DEV version from SF?

BenjaminGalliot commented 8 months ago

I tried but I'm not familiar enough with Perl to be able to generate the executable to test from the sources! Sorry ! :sweat_smile:

jccollins commented 8 months ago

I tried the 2.21.beta version, and it worked, provided that the file and directory names on linux were all NFC.

But on a Unicode-normalization-sensitive system, it now fails if the names aren't NFC. That's unlikely to be the case for most users in Western Europe, since when typing in characters, typical keyboard layouts give pre-composed characters, i.e., NFC. So they will create files with NFC names. At least if the files are created within the linux

However, on macOS, suppose I have a file or directory whose name is NFC. Then I rename the file in the Finder, without even touching the non-ASCII characters. After the rename, the name is NFD! I've seen complaints about that on the web. (Korean users seem to be particularly bothered.) Command line commands (mv, etc) don't have this problem.

Luckily, at least by default, the macOS and its file systems are Unicode-normalization insensitive, so this issue doesn't seem to be too big a deal for our purposes. But transferring the files to linux could cause all kinds of interesting anomalies! It might be useful to have a little script to rename all files and directories to have a particular normalization. Perhaps one already exists.

Pdflatex, at least on TeXLive 2024, preserves Unicode normalization from the .tex filename to the names of generated files, and the same applies to latexmk. I haven't tried this with xelatex and lualatex, but I would conjecture they have the same behavior.

Would it not be better for biber to preserve the normalization of what's on the command line, since that matches better the behavior of the other programs involved? (With latexmk I went through an initial phase of thinking the internal use of NFD would be a good idea; there are recommendations that that is the "correct" thing to do. But that led to a minefield of other complications, so I abandoned that.) What problems would that changed behavior lead to?

John

On 3/29/24 11:43 AM, plk wrote:

Looks like I forgot to NFC the filename. |biber| is all NFD internally and it should NFC everything on output but it looks like this was missed. Can you try |biber| 2.21 DEV version from SF?

— Reply to this email directly, view it on GitHub https://github.com/plk/biber/issues/474#issuecomment-2027402356, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXT47OWWNSLJ2IWOEKF2PLY2V4YXAVCNFSM6AAAAABFM5IA6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXGQYDEMZVGY. You are receiving this because you commented.Message ID: @.***>

plk commented 7 months ago

Well, you have to use NFD internally because there lots of tricky things that have to be done with independent combining chars etc. I can however, have a look at preserving filenames from the form of the .bcf file.

plk commented 7 months ago

Please try 2.21 from SF again.

jccollins commented 7 months ago

On 4/13/24 2:19 PM, plk wrote:> > Well, you have to use NFD internally because there lots of tricky things that

have to be done with independent combining chars etc. I can however, have a look at preserving filenames from the form of the .tex file.

For the textual content of things like the author fields in .bib files, I agree that the internal use of NFD is suitable. That's because for ordinary text, characters that differ by normalization are intended to be equivalent.

But for filenames, things are entirely different. The combinations of Windows with NTFS and FAT32, and linux with ext4 (and IIRC FAT32) are all normalization sensitive. E.g., in these cases it's perfectly possible to have two different files whose names are identical except for Unicode normalization.

So preservation of Unicode normalization of filenames is compulsory, as far as I can see. There are effectively two different worlds of strings: Those for ordinary text and those for filenames.

Of course, if you are typing filenames in Windows and linux, you are probably going to get only NFC filenames, at least with standard keyboard layouts for many Western European languages.

But it's easily possible to get NFD filenames if you generate the files on macOS and transfer by a normalization-preserving method (e.g., in a zip file from macOS to unix). That's because GUIs in macOS coerce filenames to NFD. That doesn't matter much on macOS, since by default it is insensitive to the Unicode normalization of filenames. But once the files are on linux or Windows, there are complications.

John

jccollins commented 7 months ago

On 4/14/24 10:27 AM, plk wrote:>

Please try 2.21 from SF again.

Sorry, but it doesn't work. I see at least the following problems

  1. When I run this version of biber with a bcf file named NFC-café.bcf (with NFC coding), the bbl file has the name NFC-café.bbl. I conjecture that what has happened is that Perl's encode subroutine was applied to a string that was already UTF-8 encoded.

I can reproduce this kind of situation in a Perl script if I do

      use utf8;
      my $orig = 'NFCé';
      my $enc1 = encode( 'UTF-8', $orig );
      my $enc2 = encode( 'UTF-8', $enc1 );

The string $enc2 has content that is the UTF-8 encoding of 'NFCé'. The string $enc1 has the correct UTF-8 encoding of the original string.

  1. The same error occurs in the blg file for the strings for the names of .bbl file. I've attached a zip file containing an example.

  2. If the OS is linux, and the bcf file is in a directory named résultats, this version of biber, just like 2.20, still tries to write a .blg file whose name is the NFD version of what it should be writing. That gives a fatal error, since there is no directory whose name is the NFD version of 'résultats'.

John

--------------iZI3vq0Dv8G5a1I7JF6nqX00 Content-Type: application/zip; name="biber-issue.zip" Content-Disposition: attachment; filename="biber-issue.zip" Content-Transfer-Encoding: base64

UEsDBBQAAAAIAM9YklhR9cpGEAIAAFoEAAAOABwATkZDLWNhZmXMgS5ibGdVVAkAA/Y2IWZgOCFm dXgLAAEECQIAAAQUAAAArVNNjpswFN7nFG9HKw0EEwIkEpXaVJEqjZKqk6iVoiwMMcQaByNj0mZ2 VU/Ta3S2PVRtl0Cgs6yEBMbv+3mfn3fuHha8yGjulKf5BPlv4MNquQYbNkdagXre0YQI8BwPwauE SPx6tBtgwhZzz/OMMqJh1mq5sFOckd/fnYTl1mjnRXtINJmtyRzNZZ+9+cTvNOM4Vro1vC0FoOgO PNfz7wChuRvMJ+5oh9zZ/q8hrex7bov8RPCBFjlYzrgRfv7pJGmmdJE/uQHNZp3cktfFARCkVJJH clHtFtohVCSVlBegFafuraIfTFr0R8FTUlVatQeY3gKCqX+TDn/U1RkXWkaSb2DSstSCUZ4LXB4p cdTKMjU91mDfYDQtClHn4x5vyBc4kJSbBBzH0fVRr36KZoOulUUDwxJDxWuRvuhDMYXeHrYLzhiW RHMFUcu0PhMhqFFlPMW6FVLY2wdLuclwzaQagzMWFCdqK4bqSDNJDhZ8pfLY2yl4YdO84EKv/4Oo YjphRp+wyS+G1fL9VXW4VQpy/aWsNdI3IxZ1QT9wIY0sraQikpdxzniC2fj67r0s4BnIS2n8SXFp DEhyKnVXhsACrI6i38YLFrqzW3F1WEKCxJRxoc3gs/o0MeqR+ZdKz03CTIZh0PJ8FtS0cr2lzz9+ qeuSsMYkKZphsrabpR0NiMKwM7SuZVkrPxwGTKM/UEsBAh4DFAAAAAgAz1iSWFH1ykYQAgAAWgQA AA4AGAAAAAAAAQAAAKSBAAAAAE5GQy1jYWZlzIEuYmxnVVQFAAP2NiFmdXgLAAEECQIAAAQUAAAA UEsFBgAAAAABAAEAVAAAAFgCAAAAAA==

--------------iZI3vq0Dv8G5a1I7JF6nqX00--

plk commented 7 months ago

I will have a look - I suspect that the log4perl module is doing some normalisation which isn't obvious as it's that which creates the .blg.

plk commented 7 months ago

Can you please try 2.21 dev again from SF?

jccollins commented 2 months ago

Sorry for my delay in replying, and dropping the ball on this. I hadn't checked this site, which I should have done, and didn't see the Apr 27 message until Benjamin recently pointed it out.

I've just tried the current 2.21 beta (as of 19 Sep 2024), and find the following:

  1. If the name (including directory component) of the bcf file is pure NFC, then the files from this version of biber have the correct names.
  2. The same if the name is pure NFD.
  3. But if the name of the bcf file has mixed Unicode normalization, then biber coerces the names of the output files (bbl and blg)to pure NFD. This creates a variety of problems, of which I'll give a realistic example below. I think this is incorrect behavior on the Linux and Windows since their standard file systems are defined to be Unicode-normalization sensitive.

In the simplest cases on Linux and Windows, files and directories have NFC names, in which case there is no problem.

However the situation with mixed normalization can arise in practice in a cross-OS situation:

  1. GUI programs on macOS (e.g., TeXShop) create files and directories with NFD names; that appears to be enforced by the OS. If these are transferred to Linux and Windows, it is often true that the normalization is preserved, and hence NFD names of .tex files do occur in reality on Linux and Windows systems.
  2. When such a file is compiled by *latex on Linux (or Windows) and an output directory is specified with a name like 'résultats', that name would typically be NFC. So the combination of directory and file name on the command line will have mixed normalization.
  3. When the current 2.21 beta version of biber is used on the resulting bcf file, it gives an error that it cannot write the blg file. This because it is trying to create a directory with the name NFD('résultats'), which probably doesn't exist.

I suspect it would be useful to have a little utility to rename files to be pure NFC (or pure NFD).

plk commented 2 months ago

I put in a fallback for 2.21 dev which selects an appropriate default for mixed form, depending on OS (NFD for mac, NFC for everything else.

jccollins commented 2 months ago

I've tried the new version. It still doesn't solve the problem. Even so, this version is an improvement on the 2.20 release, which doesn't work when the name of the bcf file contains an accented character coded as NFC.

First, on macOS, there's actually no need to set a normalization form for file names, since the file systems are normalization insensitive.

On Linux (with its default file system ext4 at least), there are two use cases for the command line to biber:

  1. The name of the .bcf file is obtained directly or indirectly from a file click in a GUI program, or from tab completion at a terminal prompt. In that case, the normalization of the filename is exactly what it is on disk, and there will be errors if biber changes the normalization. The situation is no different than if biber were to force file names to be all upper-case or all lower-case.
  2. The user types the filename at the command line using normal keyboard methods for entering accented characters. Almost always, the typed name will be NFC. If that matches the on-disk name, there will not be a problem. But if they don't match, because of different normalization, then there will be a file not found error.

I've tried this on a file with the name NFC('NFC-résultats').'/'.NFD('NFD-café.bcf'), in Perl notation. This is a name that could well occur in practice (see my comment from yesterday). Here's the result

INFO - This is Biber 2.21 (beta)
INFO - Logfile is 'NFC-résultats/NFD-café.blg'
Wide character in print at /usr/local/share/perl/5.38.2/Log/Log4perl/Appender/Screen.pm line 57.
ERROR - Cannot find control file 'NFC-résultats/NFD-café.bcf'! - Did latex run successfully on your .tex file before you ran biber?
INFO - ERRORS: 1

Both the .blg and .bbl files are written, but with the base name being NFC('NFD-café'). Because of the error, the .bbl file is incorrect (zero length). The line of error message has exactly the same mixed coding as on the command line. (Note that the github web interface coerces that to NFC in this comment.) The Perl error on the line above indicates that there is a coding problem in line 57 of Screen.pm. Presumably a decoded Unicode string was passed to print, but no coding system was specified for stdout, and the Unicode string contains one or more code points that are above 255. This conjecture matches the situation for the given filename, since the string contains a COMBINING ACUTE ACCENT, whose decimal code point is 769.

It's perhaps worth adding that there are some related issues with v. 2.19 of biber in all but the case with pure NFC filenames. In addition, 2.19 mangles the coding of some of its output to at least the screen and the .blg file. V. 2.21 (beta) appears not to do this, so it is an improvement. (I verified this just now by switching to TeX Live 2023.)

plk commented 2 months ago

I think it is likely impossible to fix this - when you read in a file and the path has mixed NFD/NFC, you can't write the path in mixed mode (without a great deal of really hacky messing about). Right now, biber checks the form of the input path and if it's mixed, it outputs the default form for the OS, as mentioned. If people have mixed forms on disk, they should deal with that - there are tools on Linux to convert dirs/files to NFC (convmv etc.). Do you have any reference for the form-neutrality of the default MacOS FS? Famously, it's NFD, which causes all sorts of issues with rsync and git.

jccollins commented 2 months ago

I don't understand what is meant by "you can't write the path in mixed mode". I know explicitly that in Perl if I create a file with mixed normalization by Perl's open function, the on-disk filename is exactly the one specified. This is at least true on:

  1. APFS on macOS,
  2. ext4 and FAT32 on Ubuntu,
  3. NTFS and (IIRC) FAT32 on Windows 11.

In the Ubuntu and Windows 11 cases, I can have different files whose names only differ by normalization.

(On macOS, with its older HFS+, the filename is coerced to NFD, independently of the string given to the open function.)

The default situation for APFS on macOS is that it is (a) insensitive to both normalization and case, and (b) preserves both. This is my experience. The only reasonably official documentation I was able to find quickly is https://developer.apple.com/library/archive/documentation/FileManagement/Conceptual/APFS_Guide/FAQ/FAQ.html That document is an old one, which says it is "retired". But its statements about normalization-insensitivity match my current experience. That's for macOS. I remember reading about a different situation on iOS, but I've no experience, and I think it is irrelevant here.

jccollins commented 2 months ago

It's definitely possible to fix the problem. Existence proofs:

  1. All of the *tex programs in TeXLive that I've used preserve normalization between input and output files: E.g., the base name of the .log, .aux, etc files is exactly the same as that of the main .tex file (including normalization).
  2. Latexmk also has no problems. The base names of the files it generates are obtained by copying from the base name of the source file, without any changes. It has always done this, and exactly nothing needed to be changed to preserve Unicode normalization.

Literally all the difficulties with Unicode and latexmk that I encountered were in dealing with the code page issue on Windows. That's only because of deficiencies in current versions of native Windows perl interpreters. (As you know with biber, the problems can be solved by setting the Windows system locale to use UTF-8.)

[It may be useful to copy the relevant part latexmk solution for the code page issue into biber (and to copy into latexmk the things biber does with the use of the wide, i.e., Unicode, interface to the Windows file system).]

jccollins commented 2 months ago

A further comment about macOS, APFS and normalization: If you do a Google search for APFS and normalization, most of the hits you get contain statements that are quite misleading, if not wrong. I suspect there was a change in the implementation of APFS after the first version was released, and the comments seen refer to the old version.

jccollins commented 2 months ago

I've been able to incorporate a version of latexmk's treatment of Windows code pages, so that biber now works on Windows when filenames contain non-ASCII characters, independently of the setting of the System locale/code page. It continues to work on Linux (if file and directory names are NFC) and on macOS.

I need to clean up my code before I submit it.

plk commented 2 months ago

Would you be interested in helping develop/support biber since you know Perl? It's currently just me and I have to think about the future of support for it at some point ...

plk commented 2 months ago

So, we should basically not touch any normalisation for filenames, just for file contents. Now, getting back into this, I've removed all the messing about from the filename (not file content) normalisation which should be better but I think that leaves us with a problem with the .blg as before?

jccollins commented 2 months ago

I'll merge my changes into the version you've made. That should solve a lot of problems.

In addition to handling the Windows code page issue, I found the following: The calls to the file system for opening files tended to use decoded strings rather then encoded byte strings. Generally that gives problems, which become particularly visible on Windows. So I corrected all the cases which I've found so far.

Perhaps you should hold off on further changes until I send mine to the repository. (My changes affect biber itself, Biber/Config.pm, and Biber/Output/base.pm.)

plk commented 2 months ago

@jccollins - I added you to biber as a Collaborator - you should get an invite.

jccollins commented 2 months ago

I pushed my updates to the repository.

In the testfiles directory I added a directory of files with non-trivial Unicode names. They are useful for tests.

In the bin directory there's an extra executable that I use for exercising my code for handling codepage issues (by an added module CodePage.pm).

Could you check how things are working: This is my first time for uploads to a github repository.

jccollins commented 2 months ago

My new version worked for most of my tests. It failed only on linux when given the hard case of mixed-normalization filenames. (Probably the same issue would arise on Windows when the UTF-8 system locale is used.)

I've tracked the problem down to Utils.pm, where some normalization of filenames is done. That file may have some other issues.

plk commented 2 months ago

I fixed a test issue and rearranged some things a bit - there's no need to put anything in comments about in-progress etc. in the files, we can just use github comments for that. The standard regression test suite passes now. I changes the version back to "2.21" as some install scripts rely on the format. I think you're right we will have to look at the Utils.pm subs which mess about with filename normalisation.

jccollins commented 2 months ago

I've updated Utils.pm. Things now work in all the tests I've done. What I did:

  1. Removed NFC and NFD applied to filenames, except in file_exist_check.
  2. In file_exist_check, which tests for variants, I added a test for existence of the file name with the given form, as well as its NFC and NFD variants.

I preferred to leave in some comments about difficulties remaining. I find it a lot easier when I am working on a file, to have such remarks in the file.

I also added a zip file of the Unicode-tests directory. That's better than the contents of directory, whose normalization tends not to survive the round trip to and from github. A .tar file also won't work because on macOS it doesn't always preserve normalization (as I found by trying it).

Perhaps we should remove the Unicode-tests directory, since the normalization of the filenames tends not to survive in the various kinds of processing.

plk commented 2 months ago

Feel free to remove the unicode files if there is a ZIP for testing - I pulled the Utils.pm changes and the regression test suite passes without incident.

jccollins commented 2 months ago

Done.

jccollins commented 2 months ago

I've made some minor changes in CodePage.pm and Config.pm:

  1. Previously, I'd defined a function to decode @ARGV. But it is only used once, and now the function seems to be an unnecessary abstraction. I've removed the function's definition from CodePage.pm, and replaced its use in Config.pm by a more transparent decoding of @ARGV. This doesn't affect the functioning of biber
  2. In Config.pm, I've applied a decoding from system CS to the name of the log file when that is obtained from the command line.

As far as I can tell, things are now working correctly for files specified on the command line to biber. There are the following restrictions:

  1. On Windows, there is the usual restriction for all perl scripts that the command line can only contain characters that are in the system CP (e.g., CP1252 for US and many Western European countries).
  2. On Linux, the filenames must be Unicode encoded in UTF-8. This is the usual situation nowadays. (But Linux is emphatic in saying that filenames are in principle strings of bytes with only NULL and '/' excluded.)

The remaining problems I know concern the names of .bib files specified in the .bcf file. The problematic situation is when the filename or glob pattern supplied to \addbibresource in .tex file contains accented characters. I've got a good diagnosis, and will report in a day or so.

BenjaminGalliot commented 2 months ago

The problematic situation is when the filename or glob pattern supplied to \addbibresource in .tex file contains accented characters

Good job for noticing this, which could easily happen to me.

jccollins commented 2 months ago

The problematic situation is when the filename or glob pattern supplied to \addbibresource in .tex file contains accented characters

Good job for noticing this, which could easily happen to me.

What I've diagnosed is that when biber reads the .bcf file it coerces the filenames and patterns for the datasources to NFD, instead of the NFC that is normal for Western European languages on Windows and Linux. This will need to be corrected, of course. Philip knows the relevant part of the code (I think it's in Biber/Input/file/bibtex.pm and Biber/Input/file/biblatexml.pl).

Then what happens is that *latex writes the intended name correctly to the .bcf file, but biber in effect misreads it. So biber ends up looking for a differently named file on disk that the one the user intended. (On both Windows and Linux, it is unfortunately perfectly possible to have two or more distinct files with names that differ only in the normalization form for accented characters, so that the names are visually identical!)

My general approach in this and similar cases is, if possible, to choose a glob pattern that doesn't contain any of the problematic characters, but which is chosen to uniquely identify the file. The current development version of biber treats that correctly.

jccollins commented 1 month ago

I've modified the subroutine glob_data_file in Utils.pm so that when it does a glob, it does a glob on the NFC and NFD variants of the glob pattern, as well as the original pattern. (But for each of the variants, it omits the NFC glob if the original pattern is already NFC, and similarly for NFD.) This matches the behavior of the file_exist_check.

This change appears to cover the most important cases when the argument to \addbibresource contains accented characters. These are when that string and the filenames are each in a definite NF, but that NF differs between the on-disk filename and the name in the .tex file.

I've also added a directory containing the .bib and .tex files that I used for tests, together with a Perl script to do the tests.

The subroutine glob_data_file still contains the statements for printing diagnostic messages about the strings involved. They make it easy to see that the filename argument to the subroutine is coerced to NFD, but that files with names of the opposite NF are now found.

jccollins commented 1 month ago

I've looked more into the problem about NFD being applied to the file/pattern names that were given to \addbibresource in the .tex file. My conjecture about where this was happening was entirely wrong. It's happening in Biber.pm itself, in the subroutine parse_ctrlfile.

What happens is that parse_ctrlfile slurps in the contents of the .bcf file, immediately converts it to NFD, and then parses the result. See lines 421--423:

my $buf = slurp_switchr($ctrl_file_path)->$*;
$buf = NFD($buf);# Unicode NFD boundary
my $bcfxml = ....

For actual text fields, this is appropriate, I imagine, but not for filename fields. At this point I don't see a nice way of fixing things to preserve normalization of filenames while converting everything else to NFD. (Of course, there are ugly hacky ways.)

In the big picture, it may be unimportant to do better than what's in the code at the moment. It handles the simplest cases that are likely to arise in practice, like the example that started this bug report. In any case, if a user has trouble with non-ASCII filenames, there is always the standard advice that for maximum portability, one should restrict file names to the ASCII characters a-z, 0-9 and -, and leave Unicode stuff to the contents of .tex and .bib files.

plk commented 1 month ago

We could simply get the filenames and save them before the NFD call?

jccollins commented 1 month ago

I've seen a better possibility. This simply to read the .bcf file line by line, preserve the datasource lines as is, and apply NFD only to the other lines. Finally the lines are concatenated to get a single string of XML code.

This completely avoids the use of File::Slurper, but, of course, at the expense of slower read time. To know whether the slower read time matters, I measured the processing times for reading and applying NFD to the 150kB .bcf file for a 600+ page book of mine. The difference between using Slurper and the use of Perl's usual line-by-line reading methods is in the millisec range (on a modern Macbook Air). That is dwarfed the 2+sec total time for biber.

I'll work on a fix.

plk commented 1 month ago

@jccollins - can we take out any raw "say" statements in CodePage.pm as these confuse any log parsers people are using - all output needs to go via Log4Perl, e.g. $logger->debug("...");

plk commented 1 month ago

Have a look at the new line in Biber.pm - I think we can can do this in a one-liner.

jccollins commented 1 month ago

Your solution as a one-liner is nice, except that the regex for splitting the contents of the bcf file into lines needs correction. The original regex was /\R\z/, which matches a line break followed by end of string. But it doesn't match a line break in the middle of the string.

So I changed that to /\R/, which results in the desired splitting into lines. But then joining the modified lines back into a string doesn't give a string terminating with \n. That's unimportant for the uses made of the string. But I think it's nicer to have the resulting string match what was in the original string. That's solved by giving split(... a negative third argument, which overrides the default behavior of split which is to discard trailing null strings.

As far as I can see, the result is correct. (I've done a few stress tests on Linux.)

jccollins commented 1 month ago

@plk About the raw say (and warn and print) statements in CodePage.pm:

The problem is that CodePage needs to do its thing with code pages etc very early, before anything else needs to use the results. This includes knowledge of the encoding of filenames in interactions with the file system, and the setting of the console to use UTF-8 (i.e., CP 65001).

I see a lot of initialization associated with the use of Log4Perl. So CodePage will run earlier than this initialization, and may therefore be unable to use Log4Perl directly. Correct me if I've misunderstood.

On the other hand, the messages that CodePage writes are mostly informational. I've found them helpful so that I know what's happening and when. But now I'm happy that the code is working as intended, I think the actual writing of the messages by CodePage is no longer important. However, they can be important for debugging.

So I propose that CodePage should simply save the messages in arrays: One for informational messages and one for warnings. Then after Config.pm has fully initialized the use of Log4Perl, it can deal with the saved messages from CodePage. Perhaps the informational messages need to be written only if the debug flag is on, or if CodePage gives warnings about something not working.

In the rest of the biber code, I see several ways of sending messages via Log4Perl. What would be appropriate here?

plk commented 1 month ago

Should be fine about the \R, I opted for the \n because it doesn't need preserving in the .bcf. I have pushed a change to CodePage.pm and moved it load to Biber.pm. Should be fine as log4Perl loads very early in the _initopts routine of Biber.pm. CodePage.pm now send Log4Perl trace level messages as the codepage stuff is relatively esoteric for most users. Seems fine in my tests and all the code stuff appears in the .blg as expected when biber is called with --trace.

jccollins commented 1 month ago

Sorry, this is a long comment, but I think I need to explain some details.

Unfortunately, the new version doesn't work correctly on Windows. It reports Log4perl: Seems like no initialization happened. Forgot to call init()?. (There are also messages that are sent to the screen by statements that use Perl's print and warn and that haven't been converted to use the Logger.)

The problem is simply that CodePage.pm's initialization is done before the biber script starts its execution phase, while the initialization of Log4perl is done later, when the biber script does

my $biber = Biber->new($opts->%*);

and the new routine of Biber invokes Config.pm's _initops, which initializes the logger.

The initialization code in the present (and previous) versions ofCodePage.pm is executed as soon as it has been parsed, i.e., during Perl's Compilation Phase. That is because CodePage.pm is brought in by a use statement (from one of the modules that the biber script itself invokes by use), and the initialization code is not inside a subroutine definition.

What the initialization code does is:

  1. Detect the current code pages (CPs) used by Windows, and save the values. The saved values of the two console pages are used to restore the console CP settings when biber exits.
  2. It sets the console CPs to 65001, i.e., UTF-8, so that biber's output to the console behaves as it does on Linux and macOS, instead of giving garbage for non-ASCII characters.
  3. There is a third saved CP, the system CP, whose value is essential to know how the byte strings in @ARGV are encoded, and to know what encoding is used for the functions that interact with the file system (e.g., open, glob).

I see a way of arranging delayed execution of the initialization code, in a way which would have a number of advantages. But there's still a difficulty with biber's start up code, which will need significant changes.

The relevant steps in the current code are:

  1. While @ARGV is still not-yet decoded the options are parsed. Among these are ones for setting the log filename and the output filename to non-default values. There are some other similar cases, but I haven't looked for all of them. The results for the logfile and the like are byte strings suitable for the the file system.
  2. Creation and initialization of a Biber object, which entails the next steps.
  3. Decoding of the part of @ARGV that is left after option processing. This requires knowledge of the system coding system (CS).
  4. The first element of @ARGV is the name of the .bcf file. It is now a proper Perl Unicode string, i.e., the result of a decode operation.
  5. It is worked out what the name of the log file is
  6. This is passed as an encoded byte string to Log4perl. This step again needs knowledge of the system CP.

I think it would be simpler not to decode @ARGV at all. Then until it is worked out what the names of the files are, the relevant strings are encoded byte strings. They are the correct strings for interaction with the file system (and for initialization involving Log4perl!). It is only after that step that we need to decode them, to match the convention used in other parts of biber. Only then do we need to know the system CS. So the initialization of CodePage.pm should be delayed until this point, and then it would be done by an explicit call to an initialization routine.

I think it would be nice to avoid having CodePage.pm explicitly dependent on Log4perl. For example, I've found it a convenient module to use in simple test scripts. The important thing here is that when an initialization subroutine is used, the subroutine can have optional argument(s) to specify what is to be done with diagnostic and warning messages. I have in mind something like references to subroutines to be used instead of the usual print/say and warn. So something like a reference to a Log4perl method could be used here. This seems to be a nicer way to set the destination for messages from CodePage.pm. Instead of having it hard coded into CodePage.pm, it is set in the same routine as initializes the logger.

Another optional argument could indicate whether informational messages about code pages are to be given at all.

One complication is that it is possible to inadvertently call routines like decode_CS_system in CodePage.pm before initialization has been done. I think it would be fairly straightforward to arrange that in such cases, initialization gets done (but with default settings).

jccollins commented 1 month ago

I've another idea for CodePage that I'm going to try out.

jccollins commented 1 month ago

I've changed CodePage.pm so that it no longer directly uses Log4perl, and is thus usable by scripts that don't use Log4perl. But it continues to stay silent, i.e., it does not write to the screen the information about its actions, which is appropriate when it is working normally. But it saves the information so that it can be accessed by the program/caller, to be used as appropriate.

The relevant information is now put into the tracing information in the logger by other modules.

I've corrected a bug in CodePage.pm that when there was a error in later parts of perl's compilation phase, the original Windows code page would not be restored after program termination.

As far as I can tell, the new version is working properly in my stress tests for non-ASCII file names on all of Windows, Linux and macOS.

With my inexperience with git, I got into trouble with your earlier commits today. I think I've corrected the problems. But you should double check.

plk commented 1 month ago

Looks fine - all regression tests pass.

jccollins commented 1 month ago

I've made a small change in Utils.pm. Instead of using the setting of the winunicode option to determine whether or not wide calls to the Windows file system are used, I've replaced that by use of the result from CodePage.pm about the system CP. I've added a utility function in CodePage.pm to make that simple and self-documenting.

This avoids the possibility that the user's use or non-use of the --winunicode option doesn't reflect the actual system configuration.

Does this sound right?

plk commented 1 month ago

Yes, I suspect you are right about this - the CodePage solution is a more general solution than the --winunicode option. Once you are happy with that, I can change the documentation to say this is automatic now.

jccollins commented 1 month ago

I'm happy with how it's working, so you can update the documentation.

The only question I have is what should happen with the --winunicode/-W option. I see 3 possibilities:

  1. Drop the option completely.
  2. Keep it for the sake of backward compatibility, but just give a warning if it's used, and a further warning if CodePage detects a non-UTF-8 system locale, i.e., when the user's view given by the option doesn't match reality. (It's more correct to say "a locale that doesn't support full Unicode" than a "non-UTF-8 locale".)
  3. Repurpose it to mean that the user requires biber only to run if it's on a UTF-8 system (i.e., linux, macOS or Windows with a UTF-8 system locale). That would be useful if a user has files whose names don't fit into the user's standard system CP, but wants to give them on the command line to biber. E.g., on my Windows system, the system CP is 1252, for Latin1. So on the command line for biber (or other perl script), I cannot use the filename Renormalization-Перенормовка.bcf, for example. The use of a UTF-8 system locale is still in beta, so a user might reasonably not want to use it by default, but might want a warning about it not being in use.
jccollins commented 1 month ago

By the way, there's a standard perl module Encode::Locale that finds the locale settings, but, as far as I can tell, more comprehensively than CodePage. As far as I know Encode::Locale is part of a standard perl installation. It may be worth changing CodePage to use Encode::Locale as much as possible. I.e., CodePage's purpose is now to provide extra functionality, like setting the Windows CP for console output to UTF-8, and providing convenient utility subroutines for use in biber.

Encode::Locale finds the Windows code pages the same way as CodePage, but it is more general, since it also deals with non-UTF-8 locales on Unix systems. Of course, on linux and macOS, anything but a UTF-8 locale is surely unusual nowadays, unlike Windows.

jccollins commented 1 month ago

But it's maybe best to leave well alone.