scambier / obsidian-text-extractor

A (companion) plugin to facilitate the extraction of text from images (OCR) and PDFs.
GNU General Public License v3.0
349 stars 19 forks source link

Feature/add office files support #52

Closed demig00d closed 10 months ago

demig00d commented 10 months ago

Added support for docx and xlsx files. This PR addresses #10.

Documents are now parsed as plain text, but such an approach results in a loss of hyperlinks. To solve this problem we could consider parsing these files to a markdown format instead, @scambier, what do you think?

Parsing to markdown could be achieved by parsing docx to html (the mammoth lib I've added supports this) and then converting to markdown (this requires another external dependency, but could be useful if we plan to support html).

As for xlsx files, we could convert them to a csv format (the sheetjs lib I'm using gets the plain text from its csv function anyway) and then convert them to md.

scambier commented 10 months ago

Thanks for that PR :)

To solve this problem we could consider parsing these files to a markdown format instead

If that's included in the lib, I guess there's no reason to not take advantage of it 👍

demig00d commented 10 months ago

Thanks for that PR :)

To solve this problem we could consider parsing these files to a markdown format instead

If that's included in the lib, I guess there's no reason to not take advantage of it 👍

Unfortunately this is not the case.

I suggested using intermediate formats from which we can get markdown (with additional dependencies), the thing is that all the js libraries that convert office files directly to markdown do the same under the hood. At least the ones I could find.

scambier commented 10 months ago

If you can manage to keep the URLs that's fine, but honestly I don't even know if they are handled correctly when extracting the text from PDFs either 🤷‍♂️

demig00d commented 10 months ago

I'll try working with markdown in another PR then, if you don't mind.