Switch generated unicode tables and functions to icu4x

raskad commented 1 year ago

With the latest release of icu4x, the project includes data in their crates. This makes it very easy to just use icu4x crates for unicode properties and folds. Specifically I have replaced most of the unicode property code and tables with icu_properties and most of the unicode folding with icu_casemap. The upsides are pretty clear; we do not have to maintain all of the tables / code generation. Furthermore this enables users of regress that also use the icu4x crates to deduplicate the data tables if they can use the same version of icu_*.

To make this work, I had to change some of the folding logic and add some probably inefficient parts / allocations in some places. While doing that, I was not really happy with the current input handling, as it makes folding either complicated or inefficient. This is one of the reasons why I left this as a draft.

To test that everything keeps working as expected I ran the full 262 test suite via boa. There are a few new tests passing and no new failures.

@ridiculousfish What are your thoughts on this change? I would like to go ahead with this and potentially follow this up with some more fixes for unicode properties and if I can find a way some refactors to folding and input handling.

BurntSushi commented 1 year ago

but I am happy to be out of the business of parsing ICU files

Note that you should be able to achieve this end with the ucd-generate tool. It's what the regex crate uses. It indeed moves all case folding to regex-compile time. No case intensive matching is performed at search time. It doesn't solve the deduplication problem though.

ridiculousfish commented 1 year ago

Nice, maybe I can teach ucd-generate to emit our fold table. It uses some bitpacking techniques I learned from JSC to exploit patterns in Unicode character tables; this reduces the size to 1.6KB, compared to ~12KB for a naïve table.

raskad commented 1 year ago

Nice, maybe I can teach ucd-generate to emit our fold table. It uses some bitpacking techniques I learned from JSC to exploit patterns in Unicode character tables; this reduces the size to 1.6KB, compared to ~12KB for a naïve table.

I'm currently working on making the unfolding work with icu4x, but the way it currently works is a bit hacky. Sinde the fold table is not that big, maybe we can keep it for now. I will keep you updated. I'm also testing out if it makes sense to move the unfolding into the regular regex creation. That way we would always do it in the compiled regex and we could get rid of all the folding related branching at match time.

ridiculousfish commented 1 year ago

Ugh, our case folding has a serious bug :/ I need to fix it.

raskad commented 1 year ago

I pushed a version that implements the unfolding.

Ugh, our case folding has a serious bug :/ I need to fix it.

Which bug do you mean? I have not noticed any folding tests in the 262 suite failing, but I have to take a look again anyways.

ridiculousfish commented 1 year ago

So regress has a compact representation for Unicode's simple case folding. I made a mistake in the representation causing some characters to fold to the wrong value - if you want to see the gristly details check out 56233d6bd8c8795d4c8f01329ead0c0f1bc804e0 . I wrongly believed that the main test suite (which is substantial) would have caught this. I've since greatly improved the unit tests in this area.

ucd-generate looks ideal. My plan is to attempt to upstream our compact fold table representation into ucd-generate and then leverage that.

BurntSushi commented 1 year ago

ucd-generate looks ideal. My plan is to attempt to upstream our compact fold table representation into ucd-generate and then leverage that.

Perfect! I'll look into using it for regex-syntax as well. :-)

I wrongly believed that the main test suite (which is substantial) would have caught this.

It is amazing how many times I have wrongly believed this too.

raskad commented 11 months ago

Closing this in favor of using the ucd-* crates like mentioned in the discussion.

ridiculousfish / regress

Switch generated unicode tables and functions to icu4x #70