Question: Unicode support

talyz / fromElisp

An Emacs Lisp reader in Nix.

MIT License

33 stars 4 forks source link

Question: Unicode support #1

Closed terlar closed 4 years ago

terlar commented 4 years ago

I see that you mention that nix doesn’t have Unicode support, but as I understand it, it could just pass those bytes through to wherever. What kind of issues will be caused by Unicode characters within the parsed files?

If this is not the case, perhaps it is a valid use case and it can be raised here: https://github.com/NixOS/nix/issues/770

talyz commented 4 years ago

The issue is that the regexes don't match on unicode characters - at least not the more specific ones like matchCharacter (https://github.com/talyz/fromElisp/blob/master/default.nix#L53) which should be able to match single unicode characters, but isn't. Unicode should be fine in places where the match is generic enough that all bytes of a character are matched, though.

terlar commented 4 years ago

I see, because my config do have unicode chars in a few places. As I understood it you couldn't use this if any unciode chars were present. I wonder if I could still use this, but I guess I just have to try it.

terlar commented 4 years ago

Okay, so I did some tests and I got it to work, so it seems it is working. The only issue is when you use unicode characters together with the char specifier ?. E.g. ?λ. When I either put the numeric representation of the char instead it worked, or wrapped the char in a string.

But perhaps it is possible to fix the parser to work with ?X unicode chars, or would that make things too tricky?

talyz commented 4 years ago

I don't think it's possible to make it work, no. The relevant part of the regex matches "any character but ], [, \ (, or )" and then looks for a delimiter that is not part of the token. Strings and comments work fine since they match everything until they hit a delimiter - " for strings and \n for comments.

terlar commented 4 years ago

Okay, makes sense, I guess that is fine enough trade-off. Thank you for the explanation!