mazira / rtf-stream-parser

Contains native Node classes for transforming an RTF byte stream into tokens, and de-encapsulating HTML
MIT License
23 stars 4 forks source link

Allow inputs of strings #2

Closed JaredCE closed 5 years ago

JaredCE commented 6 years ago

What would be the chance of allowing native input of a string, rather than a file input? otherwise your library is pretty good 😄

rossj commented 6 years ago

Hi there, sorry for the delay in getting back to you.

It is possible to do the tokenization / de-encapsulization in a syncrhonous manner, where you just have a Buffer or string and not a stream. This is how I use it syncronously in one of my projects:

function deEncapsulate(rtf) {
  const onError = (err) => {
    if (err) {
      throw err;
    }
  };

  const stream1 = new Tokenize();
  const stream2 = new DeEncapsulate('either', true);

  // Hijack the push methods
  stream1.push = (token) => {
    stream2._transform(token, '', onError);
    return true;
  };

  const strs = [];
  stream2.push = (piece) => {
    strs.push(piece);
    return true;
  };

  // Pump the data
  stream1._transform(rtf, '', onError);
  stream1._flush(onError);
  stream2._flush(onError);

  const str = strs.join('');
  if (!str.startsWith('html:') && !str.startsWith('text:')) {
    throw new Error('Expected "html:" or "text:" prefix');
  }

  return {
    mode: str.startsWith('html:') ? 'html' : 'text',
    text: str.substr(5)
  };
}

Then, you can call it like

result = deEncapsulate(someBuffer);
reuslt = deEncapsulate(Buffer.from(someString));

However, this function still requires a Buffer input. You can simply make a Buffer from your string before passing it; however, you have to take care that the Buffer encoding you use matches what the rtf stream says about itself. In other words, the RTF stream itself indicates the encoding of the data, and I think there would be problems if this indicated encoding differs from the actual encoding of the Buffer. If possible, I would recommend trying to keep your data in binary format from its source.

I can work on some better documentation and maybe a utility function for this synchronous, non-stream use case.