japanoise / emsys

ersatz-emacs text editor
MIT License
3 stars 2 forks source link

Better regex library #30

Open japanoise opened 1 month ago

japanoise commented 1 month ago

tiny-regex-c just barely does what we want it to, and is not really actively maintained upstream (iirc the maintainer answers issues but doesn't merge PRs or do active development). Basically the regex support here is "well, it's good enough to do very basic tasks", not nearly as good as real Emacs, and not extensible because I cba to hop into a hostile new codebase to maintain a regex library (regexes are useful, but I have no interest in learning how they work).

It'd be better if we could either:

I believe vim and other text editors use forks of a dogeared old regex.h that was going around usenet at the time; we may be able to work based on that, if I can find it again.

nicholascarroll commented 1 month ago

Both options are good. Making (or finding one made) that exactly mimics Emacs would be ideal, as long as it truly matches, while PCRE is also a strong choice.

Another way to look at this to match to how grep works. There is POSIX grep, GNU grep and PCRE style grep (using the -P option).

Emacs

Implementing a similar regex engine from scratch, mimicking Emacs' behavior and features is pretty cool. And maybe someone has done it and shared?

POSIX Grep

The older systems use POSIX basic regex. Just like termios.h, regex.h is part of the POSIX.1-2001 standard. That's the modern, Extended POSIX Regex ('ERE'). "Both BREs and EREs are supported by the Regular Expression Matching interface in the System Interfaces volume of IEEE Std 1003.1-2001 " ~ The Open Group Base Specifications. OSX and BSD also use POSIX grep.

PCRE Grep

There is a program called pcregrep. But you can use PCRE in normal grep by using the -P option. So PCRE is not default grep. And it would be great to stick to the idea of no extra libraries and stay compatible with older POSIX systems.

GNU Grep

GNU grep does not use the regex.h provided by the system. Instead, it uses its the GNU C Library (glibc) regex library, which provides additional features and extensions beyond the standard POSIX regular expressions. You can set POSIXLY_CORRECT and gnu grep will use POSIX. But the two are very similary anyway, see this REGEX comparison chart which goes into details about the differences between GNU ERE and POSIX ERE.

My Vote

In the end my vote goes to POSIX Extended Regex (ERE) I am a fan of POSIX 2001 compliance because I actually intend to use this toy for real work on legacy servers. I have already got the CFLAGS+=-std=c99 -D_POSIX_C_SOURCE=200112L enabled in my windowScroll branch (only needed to use a custom stringdup instead of strdup).

So then it is simply:

#include <regex.h>

regcomp(&regex, pattern, REG_EXTENDED)
nicholascarroll commented 1 month ago

I just tried it out on MSYS2 / mingw64. Seems to work fine. I got regex.h from $ pacman -S mingw-w64-x86_64-libsystre regcomp(&regex, pattern, REG_EXTENDED)

nicholascarroll commented 1 month ago

Flaw in my thinking here: to think that a POSIX 2001 system's regex.h would work with your (very well implemented) bundled UTF-8 without a crazy amount of frigging around. Would need to bundle it.

Thinking more about it, GNU regex would theoretically be the best choice cos it supports most of the same constructs as Emacs regex. And you could have a config.def.h options to enable POSIXLY_CORRECT. Only problem is your project would become GPL.

GNU grep version 2.6 introduced UTF-8 support in 2010 (took their time :-o). It's in file regex_internal.h|c. It conditionally includes localcharset.h which is part of the GNU charset library.

Seems like a lot of work.

I would be inclined to go the other way and plan to some time in the future forsake POSIX 2001 for POSIX 2008, replace the bundled/custom UTF-8 management with system UTF-8 (locale, wchar, wctype) and then be able to use the system regex.h. Then emsys would have the advantage of minimizing its source code size, less dependencies on projects that might not get maintained, significantly reduce its custom code footprint - making it more familiar and simpler for people wanting to hack on it.

nicholascarroll commented 1 month ago

Not only does re.h have very few commands but it is also drastically slower than regex. I have done some wall clock time benchmarking of re.h versus the regex.h (ldd (Ubuntu GLIBC 2.35-0ubuntu3.8) 2.35).

Pattern: \bword\b re.h time: 7.587977 seconds regex.h time: 1.227302 seconds re.h is 518.26% slower than regex.h

Pattern: ^line re.h time: 0.000871 seconds regex.h time: 0.038799 seconds re.h is 97.76% faster than regex.h

Pattern: [A-Z][a-z]+ re.h time: 16.740947 seconds regex.h time: 0.720991 seconds re.h is 2221.94% slower than regex.h

Pattern: \b\w{6,}\b re.h time: 7.742236 seconds regex.h time: 0.082319 seconds re.h is 9305.16% slower than regex.h

Pattern: func\w+( re.h time: 8.228889 seconds regex.h time: 1.619219 seconds re.h is 408.20% slower than regex.h

Pattern: \b[A-Z][A-Z0-9_]*\b re.h time: 7.806865 seconds regex.h time: 0.719874 seconds re.h is 984.48% slower than regex.h

Pattern: \b(int|float|char)\b re.h time: 7.588203 seconds regex.h time: 0.045561 seconds re.h is 16555.04% slower than regex.h

Pattern: //.*$ re.h time: 7.351127 seconds regex.h time: 0.719812 seconds re.h is 921.26% slower than regex.h

Pattern: /*[\s\S]?\/ re.h time: 7.353695 seconds regex.h time: 0.723316 seconds re.h is 916.66% slower than regex.h

On the other hand, I never really expected much from a lightweight little editor. I mean, I would just use grep/sed/awk.

So this issue is just for incremental regex search and replace right?

japanoise commented 1 month ago

So this issue is just for incremental regex search and replace right?

For now, yeah. I'm not really planning on using it for much else than basic interactive usage.