Wrong regex for characters

MrEbbinghaus commented 2 years ago

The current regex for characters doesn't match all Clojure characters.

Character { "\\" (std.asciiLetter | std.digit | "@")+ }

Clojure allows after the \:

any single, non-space character from the Unicode BLP
Exactly these keywords: newline, space, tab, formfeed, backspace, return
Unicode
- as octals from \o0 to \o377
- as hex from \u0000 to \uFFFF

Here you can find the code for the clojure.tools.reader implementation: https://github.com/clojure/tools.reader/blob/6bc1352113f7154b6e47b7941ab55f0c5e90517b/src/main/cljs/cljs/tools/reader.cljs#L140-L181

The (JavaScript) Regex for this is:

/\\(o[0-3]?[0-7]{1,2}|u[0-9a-fA-F]{4}|newline|space|tab|formfeed|backspace|return|\S)/

I don't think this fixes https://github.com/nextjournal/clojure-mode/issues/9, and you should definitely look into this issue.

mk commented 2 years ago

Hey, thanks for looking into this!

I think we'd need to fix this in the lezer grammar around https://github.com/lezer-parser/clojure/blob/172cf311376271a95986978e7041cb7dbd3fdd57/src/clojure.grammar#L114.

Feel free to open a PR against that if you'd like to look into this. Otherwise we'll look into it but think it will take us a while. Thanks again.

MrEbbinghaus commented 2 years ago

What about https://github.com/nextjournal/clojure-mode/blob/master/src/nextjournal/clojure_mode/clojure.grammar ?

mk commented 2 years ago

That’s a copy from lezer-clojure only used as a dev affordance but it would be best to fix it upstream including tests and then update it here in a second step.

MrEbbinghaus commented 2 years ago

@mk I opened a PR for this one: https://github.com/lezer-parser/clojure/pull/17

It also fixes https://github.com/nextjournal/clojure-mode/issues/9

nextjournal / clojure-mode

Wrong regex for characters #22