iter-tools / regex

A streaming regex evaluation engine
MIT License
11 stars 1 forks source link

@iter-tools/regex

@iter-tools/regex is a fully-featured regex engine, scripted in javascript. The engine's implementation is non-backtracking, which makes it ideal for matching against streaming inputs of any kind. It is expected to be used most commonly in the building of streaming parsers, especially in conjunction with @iter-tools/parserate (coming soon!).

Not everyone needs a streaming regex engine. If you are matching a static regex against string data, it is very likely that you should be using the native regex implementation. However if you are working on data that is fundamentally a stream and this engine may save you from having to load all the data into a string first. If perf is your only reason to use this engine, make sure to do some tests to see that you are actually gaining perf. The engine is still quite slow!

Performance

The non-backtracking design also means the engine is not vulnerable to the phenomenon known as catastrophic backtracking, which can make some not-uncommon naively written patterns have essentially infinite time cost to evaluate. This makes the engine more suitable for use with user-supplied patterns, especially when combined with tools like glob syntaxes which can offer users some of the power of regex (and compile to regexe) but without the steep learning curve of regex syntax.

While the engine is not vulnerable to catastrophic backtracking, it can still be attacked or misued. Bad patterns will tend to cause the engine's match state to balloon in size, consuming lots of memory.

In terms of raw performance, this library is still extremely slow -- 50x - 80x slower than native regex for normal patterns, and currently up to 2000x slower for certain patterns that do not contain any branches. Current work is on closing this perf gap, and there is reason to think it can be narrowed significantly.

API

test(pattern, input)
exec(pattern, input)
execGlobal(pattern, input)

Note that this API is exported as three separate submodules, each with a slightly different purpose!

The modules are:

test

import { test } from '@iter-tools/regex';
import { test as testAsync } from '@iter-tools/regex/async';
import { test as testChunked } from '@iter-tools/regex/async/chunked';

const didMatch = test(pattern, input);
const didMatch = await testAsync(pattern, input);
const didMatch = await testChunked(pattern, input);

didMatch will be true if pattern matches at some location in input.

exec

import { exec } from '@iter-tools/regex';
import { exec as execAsync } from '@iter-tools/regex/async';
import { exec as execChunked } from '@iter-tools/regex/async/chunked';

const captures = exec(pattern, input);
const captures = await execAsync(pattern, input);
const captures = await execChunked(pattern, input);
// 1-indexed by lexical order of `(` ($2 is b)
const [match, $1, $2, $3] = exec(/(a(b))(c)/, input);

captures will be the array of [match, ...captures] from the first location where pattern matches in input. This method differs from the spec in that it returns [] (NOT null) when pattern is not found in input. This is so that it may be used more nicely with destructuring. If you need to check if the match was present, you can still do it nicely with destructuring syntax:

const [match = null, $1] = exec(/.*(a)/, input);

if (match !== null) console.log(`match: '${match}'`);
if ($1 !== undefined) console.log(`$1: '${$1}'`);

execGlobal

import { execGlobal } from '@iter-tools/regex';
import { execGlobal as execGlobalAsync } from '@iter-tools/regex/async';
import { execGlobal as execGlobalChunked } from '@iter-tools/regex/async/chunked';

const [...matches] = execGlobal(pattern, input);
for await (const match of execGlobal(pattern, input)) { }
for await (const match of execGlobalChunked(pattern, input)) { }

matches is an iterable of match arrays (Iterable[[match, ...captures], ...matches]). If pattern is not found in input the iterable of matches will be empty. execGlobal interacts with the global (/g) flag. If the /g flag is not present the matches iterable will never contain more than one match.

Patterns and flags

Some syntaxes are unsupported. Unsupported syntaxes are still parsed as long as they are in the well-supported regexpp parser, so that you will not be allowed to write expressions which would not be compatible with other engines.

Credits

Thanks Jason Priestley! Without your blog post I would not have known where to start. Also thanks to my friends and family, who have heard me talk about this more than any of them could possibly want to.