nleroy917 / textractor

A simple text extractor for various files. Includes core functionality for extracting text from files, a command-line interface, restful API, and python bindings.
1 stars 0 forks source link

Optimize powerpoint file extraction #1

Open nleroy917 opened 1 month ago

nleroy917 commented 1 month ago

The PowerPoint file text extraction leaves a lot to be desired. It's a little over simplified and doesn't find text that isn't directly in the ppt/slides/ directory. Should it do this?

nleroy917 commented 1 month ago

@isaac-d-cohen let me know if you have thoughts. Would love contribution! I'm winging all of this. Big rust noob

Isaac-D-Cohen commented 1 month ago

Thanks for opening this issue! Yeah, I agree that it should find text anywhere on the slides. I also found an example of a slide with text directly on it that the text extraction feature doesn't find: test4.pptx

I don't know how to go about this though. I'm an even bigger noob, having just learned Rust this spring semester in one of my classes. But I would guess we need to find out where in a PPTX text can legally be located. It seems really daunting though.

nleroy917 commented 1 month ago

It seems really daunting though.

yeah the open-xml specification is absurd. The powerpoint extractor would probably have to read the actual documentation for PresentationML to really figure it all out.

Realistically, the extractor will just have to be incrementally updated as the crate gets updated to parse it better and better