Closed demig00d closed 10 months ago
Thanks for that PR :)
To solve this problem we could consider parsing these files to a markdown format instead
If that's included in the lib, I guess there's no reason to not take advantage of it 👍
Thanks for that PR :)
To solve this problem we could consider parsing these files to a markdown format instead
If that's included in the lib, I guess there's no reason to not take advantage of it 👍
Unfortunately this is not the case.
I suggested using intermediate formats from which we can get markdown (with additional dependencies), the thing is that all the js libraries that convert office files directly to markdown do the same under the hood. At least the ones I could find.
If you can manage to keep the URLs that's fine, but honestly I don't even know if they are handled correctly when extracting the text from PDFs either 🤷♂️
I'll try working with markdown in another PR then, if you don't mind.
Added support for docx and xlsx files. This PR addresses #10.
Documents are now parsed as plain text, but such an approach results in a loss of hyperlinks. To solve this problem we could consider parsing these files to a markdown format instead, @scambier, what do you think?
Parsing to markdown could be achieved by parsing docx to html (the mammoth lib I've added supports this) and then converting to markdown (this requires another external dependency, but could be useful if we plan to support html).
As for xlsx files, we could convert them to a csv format (the sheetjs lib I'm using gets the plain text from its csv function anyway) and then convert them to md.