Add script to regen unicode file

ccleve commented 1 year ago

Addresses #235, #423

This seems to work. The format of new file is a little different: the char classes are sorted, and single-char ranges are replaced by just a single char. For example,

[\u1234-\u1234] -> [\u1234]

skvadrik commented 1 year ago

I have regenerated unicode_categories.re and tests with the old haskell script: https://github.com/skvadrik/re2c/commit/e3ec259703ab47008b1e4996317ad23b37131921, and I can already see that the nontrivial changes to the character ranges are the same as in the python script. What remains is to sort and fix single-character classes. I'll try to do that ASAP but I'm traveling in the next few days so it may have to wait a bit.

ccleve commented 1 year ago

I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster...

skvadrik commented 1 year ago

I'm really struggling to figure out how to rebase and get rid of the extra commit. Apparently something I did closed this PR. Git is such a walking disaster...

It can be very confusing. I github auto-closed because you pushed a commit saying "Merge pull request https://github.com/ccleve/re2c/pull/1 from skvadrik/master" (it's a github feature, not a git one). I wonder if it can be configured in settings (to disallow github to be "smart" and close PRs / bugs based on keywords).

So what you need to do now to get nice linear history without merge commits is:

git rebase -i HEAD~3
in the editor, you will see three latest commits: 1) https://github.com/skvadrik/re2c/pull/425/commits/0a8aa3b084e288c27b129175a40ae67956aa5ca7, 2) Merge pull request https://github.com/ccleve/re2c/pull/1 from skvadrik/master and 3) Merge branch 'master' of https://github.com/ccleve/re2c.
You need to squash commits 2 and 3 into 1.
To do that, replace "pick" word in front of commits 2 and 3 with "fixup" (it is the same as "squash", but it does not try to merge commit messages and uses the first one.
Save and exit the editor (git should say "successfully rebased...").
Then git push -f (force-push, as you are rewriting history, which is not allowed by default as it is a destructive operation).

And after I merge my remaining work (sort + fix dupes), you will need to rebase your work on top of mine:

git pull --rebase skvadrik master

ccleve commented 1 year ago

Thanks, but I'm not seeing the same commits after the git rebase command, and I can't take the time to figure this out.

I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense.

On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python.

skvadrik commented 1 year ago

I recommend just closing or deleting this PR, and then copying and pasting my script into your repo. Life's too short to deal with git nonsense.

Ok, I'll see what I can do. Thanks for the script!

On another matter: it seems that it's not easy to get Python to spit out other unicode character properties, like Word Break or Script. I haven't found a good module that can do them, and I don't want to parse unicodedata.txt myself. I ended up writing code in Java to generate these files because Java and ICU4J support is really, really good. I'm happy to contribute Java code to generate the files, although it will be a lot harder for users to use Java than Python.

Right, a Java script has the same problem as a Haskell script: it may be nontrivial to run (depending on the developer's environment.

skvadrik / re2c

Add script to regen unicode file #425