ocaml / dune

A composable build system for OCaml.
https://dune.build/
MIT License
1.63k stars 404 forks source link

dune-project parser does not work if dune-project has a UTF-8 byte-order mark #9396

Open jonahbeckford opened 11 months ago

jonahbeckford commented 11 months ago

Edits

  1. Narrowed the issue to just supporting UTF-8 with BOM (ie. supporting the UTF-8 spec). Previously the original problem report was UTF-16 + BOM but as mentioned by @Alizter OCaml does not support UTF-16.

Expected Behavior

In PowerShell on Windows:

echo "(lang dune 3.12)" > dune-project
dune build

I would expect the project to build.

Actual Behavior

File "dune-project", line 1, characters 0-34:
1 | ��(lang dune 3.12)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error: Invalid first line, expected: (lang <lang> <version>)

The reason? Windows conventionally has a byte-order mark in its Unicode files. The built-in PowerShell (or 5.x or less) inserts the BOMs; newer Command Prompts do not.

> with-dkml file dune-project
dune-project: Unicode text, UTF-16, little-endian text, with CRLF line terminators

> with-dkml od -xc dune-project
0000000    feff    0028    006c    0061    006e    0067    0020    0064
        377 376   (  \0   l  \0   a  \0   n  \0   g  \0      \0   d  \0
0000020    0075    006e    0065    0020    0033    002e    0031    0032
          u  \0   n  \0   e  \0      \0   3  \0   .  \0   1  \0   2  \0
0000040    0029    000d    000a
          )  \0  \r  \0  \n  \0
0000046

Specifications

nojb commented 11 months ago
> with-dkml file dune-project
dune-project: Unicode text, UTF-16, little-endian text, with CRLF line terminators

If this is accurate, then the BOM is only the start; I don't think Dune is able to decode UTF-16 text at the moment.

rgrinberg commented 11 months ago

I imagine this is an issue for regular dune files as well. In any case, it seems reasonable for us to handle it.

jonahbeckford commented 11 months ago

Okay, there are two separate problems, both exposed on Windows. The first is UTF-16 encoding is not yet supported. The second is that BOM is not supported, even in UTF-8.

For example:

> with-dkml file dune-project
dune-project: Unicode text, UTF-8 (with BOM) text, with no line terminators

> with-dkml od -xc dune-project
0000000    bbef    28bf    616c    676e    6420    6e75    2065    2e33
        357 273 277   (   l   a   n   g       d   u   n   e       3   .
0000020    3231    0029
          1   2   )
0000023

>  dune build
File "dune-project", line 1, characters 0-19:
1 | (lang dune 3.12)
    ^^^^^^^^^^^^^^^^^^^
Error: Invalid first line, expected: (lang <lang> <version>)
nojb commented 11 months ago

We will need an UTF decoder. The best one is in the standard library, but only on 4.14 and later.

Alizter commented 11 months ago

Does OCaml even compile with utf16 .ml files? Most other compilers I know, gcc, rustc don't accept any other encoding than utf8. Only ones I am aware of are Microsoft compilers and Java. It doesn't seem like a bug but more like a feature request to accept dune files in utf16 encoding.

Also from a cross platform perspective, if you want the dune files to work on other platforms, anything other than utf8 seems like a foot gun.

I've read that powershell allows you to configure the default piping output from utf16 to utf8. Apparently that is also the default on newer versions.

I'm tempted to say that echo is not the correct tool in this case, and that you should be passing PS arguments fixing the encoding. If this is for scripts, then this is probably fine, but for users, I wouldn't recommend writing Dune files this way and simply using an editor would be an improvement.

In the mean time, this is a good opportunity to improve the error message and say that we only accept utf8. Also updating the docs is a good idea.

jonahbeckford commented 11 months ago

Does OCaml even compile with utf16 .ml files?

I never even considered that. After testing UTF-16 ... no, it does not. But neither does it work with BOM-encoded UTF-8 ... which is valid UTF-8.

Anyway, Dune should not need to support encodings that OCaml does not support. Sadly, I can't find a specification for what encoding OCaml supports! I vaguely remember some thread that OCaml doesn't support UTF-8 source files ... I think it was only ISO 8859-1 or some Latin variant.

Flow chart:

Aside: C has no spec for source code encoding, so gcc expecting UTF-8 makes sense. rustc was a better example. But more popular languages (Javascript, C#, Python and as you mentioned Java) do support other source code encodings.

nojb commented 11 months ago

Sadly, I can't find a specification for what encoding OCaml supports! I vaguely remember some thread that OCaml doesn't support UTF-8 source files ... I think it was only ISO 8859-1 or some Latin variant.

OCaml has no spec for source code encoding: source file contents are interpreted as raw bytes. There is however an ongoing discussion to require UTF-8: see https://github.com/ocaml/ocaml/pull/1802 and the links mentioned there.

jonahbeckford commented 11 months ago

Okay ... will have you all on the Dune team decide whether UTF-8 (including optional BOM) is what Dune supports.

rgrinberg commented 11 months ago

We'll support whatever OCaml does. I agree with your statement here:

Anyway, Dune should not need to support encodings that OCaml does not support