harshankur / officeParser

A Node.js library to parse text out of any office file. Currently supports docx, pptx, xlsx and odt, odp, ods..
MIT License
123 stars 17 forks source link

Strange tests were introduced #30

Closed c121914yu closed 3 months ago

c121914yu commented 4 months ago

image

import { ReadFileByBufferParams, ReadFileResponse } from './type.d';
import { parseOfficeAsync } from 'officeparser';

export const readPptxRawText = ({ buffer, encoding }: ReadFileByBufferParams): ReadFileResponse => {
  // buffer to local file

  const result = parseOfficeAsync(buffer);
  console.log(result);
  return {
    rawText: ''
  };
};

I simply introduced the library, and before it was implemented, inexplicable tests would appear.Has anyone come across it?

ChadHelbling commented 4 months ago

seeing the same, I think its from the pdf-parse dep https://gitlab.com/autokent/pdf-parse/-/issues/24

looks like its likely an ESM issue, there's a recommended workaround by importing from a child file 'pdf-parse/lib/pdf-parse' that this package will likely have to implement.

harshankur commented 3 months ago

I will push a new commit that will replace pdf parsing support from pdf-parse library to Mozilla's pdf.js library which is incredibly popular. And to make things simpler because pdf.js also has ESM problems, I am committing a local build of pdf.js in the code. That should fix everything. Please reopen this issue if the bug persists even after the new version which will likely be 4.1.0.