"malformed character stream"

Drup commented 9 years ago

On cairo2.0.4.6 :

foobar/cairo2.0.4.6/_build/src/cairo/index.xml line 1071 column 1300:
malformed character stream

Weirdly, it's only a warning when processing cairo and turns into a crash when processing lilis (which uses cairo2).

dsheets commented 9 years ago

This is probably an issue for lpw25/doc-ock-xml. What's in the file near that location? Which declaration from Cairo?

Drup commented 9 years ago

It corresponds to this: https://github.com/Chris00/ocaml-cairo/blob/master/src/cairo.mli#L2548-L2558

Note the é.

Drup commented 9 years ago

Also note that the file is encoded in latin1 ....

dsheets commented 9 years ago

I think the right approach here is to interpose on every string leaving cmi and cmti files in https://github.com/lpw25/doc-ock-lib/blob/master/src/docOckCmi.ml and https://github.com/lpw25/doc-ock-lib/blob/master/src/docOckCmti.ml. The interposed function will ensure the strings are UTF-8 encoded.

The cmi/cmti reading functions would accept a new optional parameter of which encoding to read. The default value would be UTF-8.

The interposed normalization would do the following:

Attempt to decode the string from provided encoding. If this succeeds, return UTF-8 encoded string of same.
If decoding fails, raise a new CharsetDecodeError exception.

Users of the library may then define their own heuristics for decoding strings. I think a sensible approach is to attempt UTF-8, then attempt ISO 8859-1 (latin-1), then fail.

I suggest we adopt uutf for this task. I'm sure @dbuenzli has an opinion on this topic, judging from the Mantis discussions regarding valid OCaml character sets.

dbuenzli commented 9 years ago

I'm not sure I understand at which level all this is happening but in general things like:

Users of the library may then define their own heuristics for decoding strings.

are rather annoying (everybody has to solve the problem and solves it differently). If would rather suggest that if the optional argument is unspecified you do the heuristics you mention. And if it specified you decode according to the provided encoding and the library should always return UTF-8 encoded strings.

It's quite clear that the dev team agrees that we need to move out of latin1-encoded sources and replacing it with something different than UTF-8 seems out of the question so I don't see any value in letting the user define their own heuristics for decoding strings.

dsheets commented 9 years ago

I agree with making the UTF-8 then latin-1 then fail heuristic the default.

Does this entire plan sound sensible to you, @lpw25?

dsheets commented 9 years ago

Subtler plan:

If the string is not in a comment, attempt latin-1. If success, return UTF-8. If failure, fail parse with CharsetDecodeError.

If the string is in a comment, use previously described UTF-8 then latin-1 then fail heuristic.

dsheets commented 9 years ago

Only the behavior of the comment parsing is modifiable by user parameter, the non-comment parsing is fixed at latin-1 -> UTF-8 with fail.

dbuenzli commented 9 years ago

Is there any concrete advantage to the subtler plan ? I see only added complexity both at the implementation and documentation level.

Drup commented 9 years ago

Don't forget to consider https://github.com/whitequark/ocaml-m17n in your plan. Regardless of the fact that it will be popular or not (I doubt it), a future proof approach should handle it.

lpw25 commented 9 years ago

I'm happy to do either the first plan or the subtler plan. Although, since we don't deal with strings, only identifiers and comments, which are kind of specified as being latin-1 (for now), I think a reasonable solution would be to always assume latin-1. We could add a deprecation warning to encourage people to avoid such things.

If/when OCaml supports UTF-8 explicitly, we can change the assumed encoding.

dbuenzli commented 9 years ago

which are kind of specified as being latin-1 (for now)

Strictly speaking they are officially deprecated since 4.01.0.

dsheets commented 9 years ago

I think people use UTF-8 in comments, today, because OCaml doesn't care but I may be mistaken.

The concrete advantage of the subtler plan is that identifiers use only the valid character set as defined by the compiler and comments are a separate matter as they are textual documentation. This separation makes preventing mixed encodings easier (see below).

I have thought of two related issues with the default-fallback first plan:

The fallback heuristic would be per string and so would not be replicable by the user who can only pass a single encoding value, not a function.
The fallback heuristic would be per string and so mixed latin-1 + UTF-8 encodings would be admissable (and I don't think we should encourage that).

If we permit the user to supply a UTF-8 coercion function, 1 disappears but 2 becomes worse.

If we go with the original try-an-encoding-then-fail design, 2 is impossible but using OCaml-standard latin-1 in identifiers but UTF-8 in comments is also impossible. If we force identifiers to use latin-1 and use the single-encoding-attempt-per-call design, we only admit latin-1 identifiers (some legacy code bases use them) and consistent UTF-8 or latin-1 comments (or the Japanese or Russian non-Unicode encodings in future).

I think we should warn/error on mixed encodings. I'd rather not even permit latin-1 identifiers and UTF-8 comments.

Consistent plan:

User parameter for encoding, default is UTF-8.
Identifiers and comments are checked together.
If encoding fails, raise.
Provide fallback helper that implements try .. with heuristic around lower (current) function.

This way, only consistent UTF-8 or consistent latin-1 files are accepted and users don't have to write their own heuristic. UTF-8 will be the default and latin-1 will be the fallback.

lpw25 commented 9 years ago

I think people use UTF-8 in comments, today, because OCaml doesn't care but I may be mistaken.

Yeah I guess they probably do.

I think that identifiers should always be treated according to OCaml's rules (so for now latin-1).

For comments, how about we avoid heuristics and instead provide a "@encoding" tag that can go in the first comment of the module -- to specify the encoding used in docstrings in the file. Possibly we could assume latin-1 but give a warning if the encoding is not given explicitly.

(For ocaml-m17n we could support a "@identifier-encoding" tag that ocaml-m17n could add automatically to treat identifiers with the wrong encoding).

dbuenzli commented 9 years ago

Yeah I guess they probably do.

Sure I do that all the time, if only to be able to write my name correctly.

For comments, how about we avoid heuristics and instead provide a "@encoding" tag that can go in the first comment of the module -- to specify the encoding used in docstrings in the file. Possibly we could assume latin-1 but give a warning if the encoding is not given explicitly.

Please don't do that. There's no need to complicate all of this. I personally don't want to go over all my modules to add that directive. Especially since UTF-8 is the way forward and you can "reliably" distinguish between UTF-8 and latin1 by first trying UTF-8 and then fallback to latin1. There's little chance that OCaml will mandate any other encodings in the future and people already naturally UTF-8 encode their files to be able to write UTF-8 string literals.

Personally I wouldn't bother about treating identifiers and comments differently and implement the non subtle proposal. There's no way you should get mixed encoding cmt files since most editors won't allow you to edit that anyways.

Keep these things as simple as possible (and enforce an appropriate standard along the way).

dbuenzli commented 9 years ago

I think that identifiers should always be treated according to OCaml's rules (so for now latin-1).

And in the non-subtle implementation if you really want to enforce that you can simply check that identifiers are US-ASCII only in UTF-8 mode (but not doing it would support ocaml-m17n out of the box).

Drup commented 9 years ago

Yeah I guess they probably do.

I know I do, and my math heavy code would be much harder to understand without it.

ocaml-doc / doc-ock

"malformed character stream" #43