skvadrik / re2c

Lexer generator for C, C++, Go and Rust.
https://re2c.org
Other
1.07k stars 169 forks source link

Add script to regen unicode file #425

Closed ccleve closed 1 year ago

ccleve commented 1 year ago

Addresses #235, #423

This seems to work. The format of new file is a little different: the char classes are sorted, and single-char ranges are replaced by just a single char. For example,

[\u1234-\u1234] -> [\u1234]

skvadrik commented 1 year ago

I have regenerated unicode_categories.re and tests with the old haskell script: https://github.com/skvadrik/re2c/commit/e3ec259703ab47008b1e4996317ad23b37131921, and I can already see that the nontrivial changes to the character ranges are the same as in the python script. What remains is to sort and fix single-character classes. I'll try to do that ASAP but I'm traveling in the next few days so it may have to wait a bit.

ccleve commented 1 year ago

I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster...

skvadrik commented 1 year ago

I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster...

It can be very confusing. I github auto-closed because you pushed a commit saying "Merge pull request https://github.com/ccleve/re2c/pull/1 from skvadrik/master" (it's a github feature, not a git one). I wonder if it can be configured in settings (to disallow github to be "smart" and close PRs / bugs based on keywords).

So what you need to do now to get nice linear history without merge commits is:

And after I merge my remaining work (sort + fix dupes), you will need to rebase your work on top of mine:

ccleve commented 1 year ago

Thanks, but I'm not seeing the same commits after the git rebase command, and I can't take the time to figure this out.

I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense.

On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python.

skvadrik commented 1 year ago

I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense.

Ok, I'll see what I can do. Thanks for the script!

On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python.

Right, a Java script has the same problem as a Haskell script: it may be nontrivial to run (depending on the developer's environment.