harshankur / officeParser

A Node.js library to parse text out of any office file. Currently supports docx, pptx, xlsx and odt, odp, ods..
MIT License
123 stars 17 forks source link

officeParser

A Node.js library to parse text out of any office file.

Supported File Types

Update

Install via npm

npm i officeparser

Command Line usage

If you want to call the installed officeParser.js file, use below command

node </path/to/officeParser.js> <fileName>

Otherwise, you can simply use npx to instantly extract parsed data.

npx officeparser <fileName>

Library Usage

const officeParser = require('officeparser');

// callback
officeParser.parseOffice("/path/to/officeFile", function(data, err) {
    // "data" string in the callback here is the text parsed from the office file passed in the first argument above
    if (err) {
        console.log(err);
        return;
    }
    console.log(data);
})

// promise
officeParser.parseOfficeAsync("/path/to/officeFile");
// "data" string in the promise here is the text parsed from the office file passed in the argument above
    .then(data => console.log(data))
    .catch(err => console.error(err))

// async/await
try {
    // "data" string returned from promise here is the text parsed from the office file passed in the argument
    const data = await officeParser.parseOfficeAsync("/path/to/officeFile");
    console.log(data);
} catch (err) {
    // resolve error
    console.log(err);
}

// USING FILE BUFFERS
// instead of file path, you can also pass file buffers of one of the supported files
// on parseOffice or parseOfficeAsync functions.

// get file buffers
const fileBuffers = fs.readFileSync("/path/to/officeFile");
// get parsed text from officeParser
// NOTE: Only works with parseOffice. Old functions are not supported.
officeParser.parseOfficeAsync(fileBuffers);
    .then(data => console.log(data))
    .catch(err => console.error(err))

Configuration Object: OfficeParserConfig

Optionally add a config object as 3rd variable to parseOffice for the following configurations Flag DataType Default Explanation
tempFilesLocation string officeParserTemp The directory where officeparser stores the temp files . The final decompressed data will be put inside officeParserTemp folder within your directory. Please ensure that this directory actually exists. Default is officeParserTemp.
preserveTempFiles boolean false Flag to not delete the internal content files and the possible duplicate temp files that it uses after unzipping office files. Default is false. It always deletes all of those files.
outputErrorToConsole boolean false Flag to show all the logs to console in case of an error. Default is false.
newlineDelimiter string \n The delimiter used for every new line in places that allow multiline text like word. Default is \n.
ignoreNotes boolean false Flag to ignore notes from parsing in files like powerpoint. Default is false. It includes notes in the parsed text by default.
putNotesAtLast boolean false Flag, if set to true, will collectively put all the parsed text from notes at last in files like powerpoint. Default is false. It puts each notes right after its main slide content. If ignoreNotes is set to true, this flag is also ignored.


const config = {
    newlineDelimiter: " ",  // Separate new lines with a space instead of the default \n.
    ignoreNotes: true       // Ignore notes while parsing presentation files like pptx or odp.
}

// callback
officeParser.parseOffice("/path/to/officeFile", function(data, err){
    if (err) {
        console.log(err);
        return;
    }
    console.log(data);
}, config)

// promise
officeParser.parseOfficeAsync("/path/to/officeFile", config);
    .then(data => console.log(data))
    .catch(err => console.error(err))

Example - JavaScript

const officeParser = require('officeparser');

const config = {
    newlineDelimiter: " ",  // Separate new lines with a space instead of the default \n.
    ignoreNotes: true       // Ignore notes while parsing presentation files like pptx or odp.
}

// relative path is also fine => eg: files/myWorkSheet.ods
officeParser.parseOfficeAsync("/Users/harsh/Desktop/files/mySlides.pptx", config);
    .then(data => {
        const newText = data + " look, I can parse a powerpoint file";
        callSomeOtherFunction(newText);
    })
    .catch(err => console.error(err));

// Search for a term in the parsed text.
function searchForTermInOfficeFile(searchterm, filepath) {
    return officeParser.parseOfficeAsync(filepath)
        .then(data => data.indexOf(searchterm) != -1)
}

Example - TypeScript

const officeParser = require('officeparser');

const config: OfficeParserConfig = {
    newlineDelimiter: " ",  // Separate new lines with a space instead of the default \n.
    ignoreNotes: true       // Ignore notes while parsing presentation files like pptx or odp.
}

// relative path is also fine => eg: files/myWorkSheet.ods
officeParser.parseOfficeAsync("/Users/harsh/Desktop/files/mySlides.pptx", config);
    .then(data => {
        const newText = data + " look, I can parse a powerpoint file";
        callSomeOtherFunction(newText);
    })
    .catch(err => console.error(err));

// Search for a term in the parsed text.
function searchForTermInOfficeFile(searchterm: string, filepath: string): Promise<boolean> {
    return officeParser.parseOfficeAsync(filepath)
        .then(data => data.indexOf(searchterm) != -1)
}

\ Please take note: I have breached convention in placing err as second argument in my callback but please understand that I had to do it to not break other people's existing modules.

Known Bugs

  1. Inconsistency and incorrectness in the positioning of footnotes and endnotes in .docx files where the footnotes and endnotes would end up at the end of the parsed text whereas it would be positioned exactly after the referenced word in .odt files.
  2. The charts and objects information of .odt files are not accurate and may end up showing a few NaN in some cases.

npm https://npmjs.com/package/officeparser

github https://github.com/harshankur/officeParser