jviereck / regexp.js

A JavaScript implementation of RegExp for debugging purpose.
BSD 2-Clause "Simplified" License
51 stars 8 forks source link

Proposal: Stream compatible RegExp implementation #8

Closed jamestalmage closed 9 years ago

jamestalmage commented 9 years ago

I think it would be cool to take the parser and AST you have created, and generate a Node.js Stream compatible version.

API would go something like this:

var input = createInputStream('hello how are you');
var streamRegex = new StreamRegex('\\w+');
var arr = [];
input.pipe(streamRegex.match())
 .on('data', function(chunk) {
   arr.push(chunk.toString('utf8'));
 });

console.log(arr);
// ['hello', 'how', 'are', 'you']

Goals / Ideas:

  1. Equivalents for match, test, split, and replace
  2. Work with very large inputs (i.e. larger than available memory). This would be the key advantage of using a Stream based version over the default.
  3. no copying of buffer data, use Buffer.concat() and buf.slice()
  4. work with multiple encodings
  5. be fast

I've searched, but have not found anything that operates this way. I did find this, but it converts the buffers to strings, and concats them (violating 2, and 3 above).

Obviously this would be a separate project from this one, but it could certainly share the parser and AST at a minimum (and likely more). I may try implementing myself, but it would be nice to have buy in / input from the contributors here, especially if I end up wanting to refactor some of the code here to facilitate reuse in my project (and help from experts on the problem domain would certainly be welcome).

I think it could be pretty powerful. Thoughts?

jviereck commented 9 years ago

Obviously this would be a separate project from this one,

I agree - creating a stream based RegExp engine is out of the scope of this project. Therefore, I am going to close this issue.

Work with very large inputs (i.e. larger than available memory)

This sounds like a good idea at first, but note, that it is trivial to write a RegExp that might match the entire stream input like /.+TheEnd$/. Therefore, designing a streaming based RegExp might require restricting the expressiveness of the RegExp language.