Closed raskad closed 11 months ago
but I am happy to be out of the business of parsing ICU files
Note that you should be able to achieve this end with the ucd-generate
tool. It's what the regex
crate uses. It indeed moves all case folding to regex-compile time. No case intensive matching is performed at search time. It doesn't solve the deduplication problem though.
Nice, maybe I can teach ucd-generate
to emit our fold table. It uses some bitpacking techniques I learned from JSC to exploit patterns in Unicode character tables; this reduces the size to 1.6KB, compared to ~12KB for a naïve table.
Nice, maybe I can teach
ucd-generate
to emit our fold table. It uses some bitpacking techniques I learned from JSC to exploit patterns in Unicode character tables; this reduces the size to 1.6KB, compared to ~12KB for a naïve table.
I'm currently working on making the unfolding work with icu4x
, but the way it currently works is a bit hacky. Sinde the fold table is not that big, maybe we can keep it for now. I will keep you updated.
I'm also testing out if it makes sense to move the unfolding into the regular regex creation. That way we would always do it in the compiled regex and we could get rid of all the folding related branching at match time.
Ugh, our case folding has a serious bug :/ I need to fix it.
I pushed a version that implements the unfolding.
Ugh, our case folding has a serious bug :/ I need to fix it.
Which bug do you mean? I have not noticed any folding tests in the 262 suite failing, but I have to take a look again anyways.
So regress has a compact representation for Unicode's simple case folding. I made a mistake in the representation causing some characters to fold to the wrong value - if you want to see the gristly details check out 56233d6bd8c8795d4c8f01329ead0c0f1bc804e0 . I wrongly believed that the main test suite (which is substantial) would have caught this. I've since greatly improved the unit tests in this area.
ucd-generate
looks ideal. My plan is to attempt to upstream our compact fold table representation into ucd-generate
and then leverage that.
ucd-generate
looks ideal. My plan is to attempt to upstream our compact fold table representation intoucd-generate
and then leverage that.
Perfect! I'll look into using it for regex-syntax
as well. :-)
I wrongly believed that the main test suite (which is substantial) would have caught this.
It is amazing how many times I have wrongly believed this too.
Closing this in favor of using the ucd-*
crates like mentioned in the discussion.
With the latest release of icu4x, the project includes data in their crates. This makes it very easy to just use icu4x crates for unicode properties and folds. Specifically I have replaced most of the unicode property code and tables with
icu_properties
and most of the unicode folding withicu_casemap
. The upsides are pretty clear; we do not have to maintain all of the tables / code generation. Furthermore this enables users ofregress
that also use theicu4x
crates to deduplicate the data tables if they can use the same version oficu_*
.To make this work, I had to change some of the folding logic and add some probably inefficient parts / allocations in some places. While doing that, I was not really happy with the current input handling, as it makes folding either complicated or inefficient. This is one of the reasons why I left this as a draft.
To test that everything keeps working as expected I ran the full 262 test suite via
boa
. There are a few new tests passing and no new failures.@ridiculousfish What are your thoughts on this change? I would like to go ahead with this and potentially follow this up with some more fixes for unicode properties and if I can find a way some refactors to folding and input handling.