feature request: Potential removal of the Regex crate

adriandelgado commented 3 months ago

Feature Request

While it is true that Rust has one of the fastest Regex libraries available, nothing beats pure string manipulation. In other languages like JavaScript and Python this is not feasible because manipulating strings directly would be too slow.

While reading #106 it occurred to me that this crate would be even faster, quicker to compile, and would support all of the features that the current regex feature flag supports without bloating the resulting binaries if we translate every regex usage to pure Rust code.

In Rust, there's only two disadvantages of doing this:

The initial development time-cost
The difference in maintainability compared to the other libraries you maintain (they would still be using regexes)

Number one its easy to solve because I would volunteer the work to do it. But number two depends on your judgement.

I think it is worth it to reduce the dependency count, the compilation times, and to increase the runtime performance. I also think that the current regexes do not change very often because they've had time to mature after all these years. That means is not that much of a maintainability burden in my opinion.

What do you think? Is this something you would be interested in pursuing?

adriandelgado commented 3 months ago

Vanilla Rust is enough but I also propose the usage of the winnow crate to produce clearer code.

jmcnamara commented 3 months ago

While reading #106 it occurred to me that this crate would be even faster, quicker to compile, and would support all of the features that the current regex feature flag supports without bloating the resulting binaries if we translate every regex usage to pure Rust code.

This thought has crossed my mind. The C version of the library (libxlsxwriter) doesn't use regexes and instead uses reasonably simple hand rolled parsing code. So it is feasible.

However, just to take a step back. What savings in compile time and binary size would a change like this give us? Is there a way to estimate that without mocking out the regex code?

Also, what about a middle ground of using regex-lite from #106 and where I add workarounds for the Unicode limitations?

And finally, regex or regex-lite could still be used as a dev-dependency for the test code, right?

adriandelgado commented 3 months ago

There are some improvements just by changing to regex-lite as you can read in their motivation. TLDR: regex-lite compiles around 60% faster and its size is around 80% lighter (471KB less). I can't thing of a way to measure the impact of this proposed change to this crate specifically without implementing it for real.

Right now, we are using the default features of the regex crate: "std", "perf", "unicode", "regex-syntax/default". The above benchmark only used the "std" feature.

I would expect that the compilation times and binary size reductions would be even greater if we use hand rolled code. Also, regex-lite is not as fast as regex, while hand rolled code would be. Because of those points I would not recommend the middle ground. The standard library has a lot of care when dealing with Unicode stuff so I would not worry about those limitations if we were to go ahead with my proposal.

About dev dependencies, we can do as we please haha. There's no impact to user facing code. I recommend keeping the regex crate in that case to avoid any edge cases.

adriandelgado commented 3 months ago

Check out these benchmarks also: https://github.com/BurntSushi/rebar regex-lite is the second fastest to compile but has the slowest runtime performance, way slower than regex.

jmcnamara commented 3 months ago

Also, regex-lite is not as fast as regex

That isn't a major concern since the regexes aren't used on the fast path. However overall I think you are right that if we are going to replace it with something then it would be best to remove it altogether.

Check out these benchmarks also: https://github.com/BurntSushi/rebar

Lol. Some work companions wrote/work on one of the better libraries in that benchmark.

I can't think of a way to measure the impact of this proposed change to this crate specifically without implementing it for real.

Probably it is just worth making the changes and seeing what the resulting size/time difference is.

Some of the regexes are simple and can be replaced with String/str starts_with(), contains() or matches() but some will be trickier.

Do you want me to kick it off on a branch and start with some of the lower hanging regexes or do you just want to jump in?

adriandelgado commented 3 months ago

I prefer If you start with the low hanging ones. I plan to start working on the others this thursday because I have some deadlines before that.

jmcnamara commented 3 months ago

@adriandelgado I've created a branch called no_regex and I've started to work through the files/modules that use regexes one by one. I'll add update as I go:

6aee84e: url.rs

If you want to review as I go then please do.

jmcnamara commented 3 months ago

Here is the replacement for the xmlwriter:

0adfbab

I'm starting to regret this now. :-)

Update: also for the lib tests:

1da4b0d

jmcnamara commented 3 months ago

I've converted the utility.rs file as well. It is starting to take shape:

8caaeea

I had to comment out 2 tests for worksheet name quoting that contain emoji characters. These aren't very important since the default will be to quote the worksheet names that contain the emoji characters. While this isn't strictly correct it isn't an error in Excel.

To work around this would require a match for all the emoji Unicode characters:

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AEmoji%3DYes%3A%5D&esc=on&g=&i=

This is 1,424 code points to match against. Some (many) could be condensed into ranges (the above tool does that with "Abbreviate"):

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AEmoji%3DYes%3A%5D&abb=on&esc=on&g=&i=

@adriandelgado can you think of an efficient way of matching that number of characters or ranges? It doesn't need to be super efficient because it is used in a function that won't be called often (or at all). Maybe a match statement will be sufficient.

A harder problem will be escaping the Excel "future" functions:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/formula.rs#L995

These could be on a fast path when writing a lot of formulas so I need to take care to make it efficient.

jmcnamara commented 3 months ago

I have converted chart.rs as well:

2185a21

jmcnamara commented 3 months ago

2 more components completed:

739142f: datetime
cc582a3: header images

I am reminded about how when I was young I wanted to build a lego helicopter but I only had horizontal rotation pieces for wheels and I didn't have a vertical rotation piece I could use for the rotors. So instead I came up with ways of using the horizontal piece vertically. That is what this exercise is starting to feel like.

Anyway, I'm almost there. I'll finish off the last regex replacement in the formula.rs file and then we will see if this refactoring was worth it in terms of compilation time/size.

adriandelgado commented 3 months ago

Sorry I haven't answered, I'm going to check out the changes thoroughly in a couple of days. In the mean time you can check out how the regex crate matches emojis and other similar stuff (they just use a table) https://github.com/rust-lang/regex/blob/ab88aa5c6824ebe7c4b4c72fe5191681783b3a68/regex-syntax/src/unicode_tables/property_bool.rs#L4419

Also, to match a lot of fixed strings very efficiently the regex crate uses Aho-Corasick. rust_xlsxwriter already has this crate as a dependency due to using regex but it is lighter weight (it just matches fixed strings). You can also try to use:

if matches!(haystack, "some_string_1" | "some_string_2" | "some_string_3" | ... | "some_last_string") {
    // ...
}

Edit: never mind, the matches! suggestion would be somewhat inefficient.

adriandelgado commented 3 months ago

I'm writing some code suggestions in some of the commits in the no_regex branch. In the case you accept the changes suggested, I would like to add them myself in a PR.

adriandelgado commented 3 months ago

I just took a look at formula.rs and the aho-corasick crate is definitively the way to go. It is very performant and it is also what regex uses underneath each time you build a regex like r"str1|str2|...|strn".

jmcnamara commented 3 months ago

I'm writing some code suggestions in some of the commits in the no_regex branch. In the case you accept the changes suggested, I would like to add them myself in a PR.

Sounds good. I hope to get my side of the changes done by Thursday/Friday. You can jump in then.

jmcnamara commented 3 months ago

I just took a look at formula.rs and the aho-corasick crate is definitively the way to go. It is very performant and it is also what regex uses underneath each time you build a regex like r"str1|str2|...|strn".

Agreed. I had a look at it and it is more or less perfect. I'll add it as a test although I may just go ahead and write a formula parser anyway. There is another area where I would need that in the future so it should be worth the effort.

jmcnamara commented 3 months ago

I've pushed the last piece of the refactoring for the formula.rs module:

f092a05

Note, this needs some refactoring and some optimization which I will work on later, so watch out for a force push.

Anyway, the no_regex branch is now regex clean. The build times look more or less the same:

cargo clean

sleep 2

time cargo build

# v0.74.0:
real    0m8.531s
user    0m13.689s
sys     0m2.363s

# no_regex branch:
real    0m8.070s
user    0m13.685s
sys     0m2.330s

The hello_world exe size is halved (although this is really just a delta which should be the same any sized app):

	v0.74.0	no_regex
Debug	9.2M	4.2M
Release	3.4M	1.6M

@adriandelgado or @dodomorandi could you maybe test as well to see if you get similar results.

Also, @adriandelgado could you check if I got the OnceLock initialization right. It works but if don't know if it should or could be global: https://github.com/jmcnamara/rust_xlsxwriter/blob/no_regex/src/formula.rs#L1048

adriandelgado commented 3 months ago

Maybe the build time didn't change significantly because cargo was already compiling the regex crate parallel to some other crate. The executable size reduction is significant though. Also the usage of pure string manipulation opens the door for more optimizations in the future.

About static variables: They are always "global" but not always accesible. FUTURE_FUNCTIONS is only possible to be referenced from the future_functions fn but it is stored in the binary as any other global. Just the ability to be referenced/named is scoped. The way you are using it is fine.

About OnceLock usage: you can use .get(function) right after calling get_or_init. There's no need to unwrap later because get_or_init always returns a valid reference. The closure inside get_or_init is guarantied to be called only once.

In formula.rs line 973: Careful with formula.chars().enumerate(). You most likely want to use char_indices. An unicode char could span several bytes in a utf-8 string.

Also, It seems like you didn't need to use aho-corasick. Good stuff! :+1:

jmcnamara commented 3 months ago

There's no need to unwrap later because get_or_init always returns a valid reference.

Got it, thanks.

In formula.rs line 973: Careful with formula.chars().enumerate(). You most likely want to use char_indices. An unicode char could span several bytes in a utf-8 string.

Good catch. That was a bug. I suppose that I could use formula.as_bytes() and iterate over it as well. I don't know if you can still use the nice Char::is_alphanumic() functions then though.

Also, It seems like you didn't need to use aho-corasick. Good stuff!

Yes, there were a few edge case that meant that parsing was better than raw match/replace. I had avoided doing this previously (in the other language versions too) but overall it is a better solution (if I got it right).

I'm made those changes and some others and forces pushed to main:

41f763c

I also need to make some doc changes since some of the formula APIs are no longer necessary now.

jmcnamara commented 3 months ago

I fixed the emoji match issue like this:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/utility.rs#L551

The big match may be inefficient (I don't know how the compiler handles cases like this) but it is in a function that is rarely called so it doesn't need optimization.

There may be emoji edge cases that I am missing but if there are then the comparison will fail in safe mode.

So, in short it is good enough.

That is the last of the work on this so I will merge it back to main. @adriandelgado I am missing one of your suggested optimizations and the other one I probably won't use. If you want to submit a PR for that please do.

I will leave this on main for about a week while I work on another feature and then I will publish it.

Thanks for the input to date.

dodomorandi commented 3 months ago

Wow, looks like a very nice work indeed! Thank you all for improving the crate!

@jmcnamara I confirm that I see comparable improvements in build size (which is impressive to be honest).

I was briefly looking at the changes, and maybe I have a suggestion, but keep in mind that it is just a theoretical thing -- it probably does not matter at all if it is not anything relevant when profiling and benchmarking. Said that: the FUTURE_FUNCTIONS probably could gain some performance using a perfect hash function, and maybe it could also help is_emoji (but I am less sure in that case, and as you already said it is probably used rarely). And another little thing that caught my eye is is_escaped, that could iterate through url one single time instead of multiple times. As I said, these cases could be irrelevant to a real benchmark, so maybe it makes more sense to check if it makes sense to try alternative approaches.

On the other hand, if you are able to see some parts of the code that occupy a considerable amount of space in a flamegraph (maybe using the examples, don't know), it could be worth to focus on these parts. In any case, nice work indeed! :heart:

jmcnamara commented 3 months ago

@dodomorandi Thanks for the feedback.

And another little thing that caught my eye is is_escaped, that could iterate through url one single time instead of multiple times.

@adriandelgado pointed that out too with a suggested fix: https://github.com/jmcnamara/rust_xlsxwriter/commit/6aee84e63ba2b368b14bb53c0172e2d6f9b410e6#comments

I've merged that upstream with Adrian as the author.

Also, I've merged everything to main.

jmcnamara commented 2 months ago

I've released this in rust_xlsxwriter v0.75.0. Thanks for the input.

jmcnamara / rust_xlsxwriter

feature request: Potential removal of the Regex crate #108

Feature Request