mazira / rtf-stream-parser

Contains native Node classes for transforming an RTF byte stream into tokens, and de-encapsulating HTML
MIT License
23 stars 4 forks source link

rtf-stream-parser

This module is primarily used to extract RTF-encapsulated text and HTML, which is a common message body format used in Outlook / Exchange / MAPI email messages and the related file formats (.msg, .pst, .ost, .olm). The RTF-encapsulated formats are described in [MS-OXRTFEX].

This module exposes high-level functions where you may pass in an RTF string, Buffer, or stream, and get out the de-encapsulated content. Additionally, this module contains two lower level stream Transform classes that handle the tokenization and de-encapsulation processs and may be used for other low-level operations.

This code is used in production at GoldFynch, an e-discovery platform, for extracting HTML and text email bodies that have passed through Outlook mail systems.

New in version 3.x

Simple Usage

This module generally needs to be used with an expanded string decoder library such as iconv-lite or iconv in order to handle the various ANSI codepages commonly found in RTF. The string decoding is done via a callback that is passed in an options object.

Using iconv-lite

import * as iconvLite from 'iconv-lite';
import { deEncapsulateSync } from 'rtf-stream-parser';

const rtf = '{\\rtf1\\ansi\\ansicpg1252\\fromtext{{{{{{hello}}}}}}}';
const result = deEncapsulateSync(rtf, { decode: iconvLite.decode });
console.log(result); // { mode: 'text', text: 'hello' }

Using iconv

import * as iconv from 'iconv';
import { deEncapsulateSync } from 'rtf-stream-parser';

const decode = (buf, enc) => {
    const converter = new iconv.Iconv(enc, 'UTF-8//TRANSLIT//IGNORE');
    return converter.convert(buf).toString('utf8');
};

const rtf = '{\\rtf1\\ansi\\ansicpg1252\\fromtext{{{{{{hello}}}}}}}';
const result = deEncapsulateSync(rtf, { decode: decode });
console.log(result); // { mode: 'text', text: 'hello' }

De-encapsulating a stream (async buffered result)

import * as fs from 'fs';
import * as iconvLite from 'iconv-lite';
import { deEncapsulateStream } from 'rtf-stream-parser';

const stream = fs.createReadStream('encapsulated.rtf');
deEncapsulateStream(stream, { decode: iconvLite.decode }).then(result => {
    console.log(result); // { mode: '...', text: '... }
});

De-encapsulating a stream (streaming result)

import * as fs from 'fs';
import * as iconvLite from 'iconv-lite';
import { Tokenize, DeEncapsulate } from 'rtf-stream-parser';

const input = fs.createReadStream('encapsulated.rtf');
const output = fs.createWriteStream('output.html');

input.pipe(new Tokenize())
     .pipe(new DeEncapsulate({
         decode: iconvLite.decode
         mode: 'either'
     })
     .pipe(output);

High-level functions

deEncapsulateSync(input[, options])

This function de-encapsulates HTML or text data from an RTF string or Buffer. Throws an error if the given RTF does not contain encapsulated data.

deEncapsulateStream(input[, options])

This function de-encapsulates HTML or text data from an RTF string or Buffer. Throws an error if the given RTF does not contain encapsulated data.

Tokenize Class

A low-level parser & tokenizer of incoming RTF data. This Transform stream takes input of raw RTF data, generally in the form of Buffer chunks, and generates "object mode" output chunks representing the parsed RTF operations. String input chunks are also accepted, but are converted to Buffer based on the stream's default string encoding.

The output objects have the following format:

{
    // The type of the token.
    type: number; // GROUP_START = 0, GROUP_END = 1, CONTROL = 2, TEXT = 3

    // For control words / symbols, the name of the word / symbol.
    word?: string;

    // The optional numerical parameter that control words may have.
    param?: number;

    // Binary data from `\binN` and `\'XX` controls as well as string literals.
    // String literals are kept as binary due to unknown encoding at this
    // level of processing.
    data?: Buffer
}

Notes:

De-Encapsulate Class

This class takes RTF-encapsulated text (HTML or text), de-encapsulates it, and produces a string output. This Transform class takes tokenized object output from the Tokenize class and produces string chunks of output HTML.

Apart from it's specific use, this class also serves as an example of how to consume and use the Tokenize class.

The constructor takes two optional arguments:

new DeEncapsulate(options);

Future Work

Currently, the Tokenize class is pretty low level, and the DeEncapsulate class is very use-case specific. Some work could be done to abstract the generally-useful parts of the DeEncapsulate class into a more generic consumer. I would also like to add build-in support for all codepages mentioned in the RTF spec.