[FEATURE] replace pandoc with epub2txt2 for Epub search

phiresky / ripgrep-all

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

Other

8.21k stars 177 forks source link

[FEATURE] replace pandoc with epub2txt2 for Epub search #138

Closed mindreframer closed 1 year ago

mindreframer commented 2 years ago

First - what an awesome project! It really makes searching of huge document libraries possible.

Currently I have a lot of issues with Epub parsing, pandoc hangs forever with 100% CPU when parsing some EPUB files, sometimes bigger, but sometimes also on smaller ones. Currently I don't have a good workaround for this.

I tried parsing those files that cause issues with https://github.com/kevinboone/epub2txt2 and it returns the content instantly. Also, judging by the amount of issues here with EPUB parsing, this could be a a good solution for many other issues.

Please consider allowing to use epub2txt2 as backend for EPUB extraction.

Thanks!

phiresky commented 2 years ago

in the next version (when i or someone finally manages to make it work), the preprocessors will be configurable per file type

mindreframer commented 2 years ago

@phiresky OMG, that would be awesome! Any ideas, how the configuration would look like? E.g when I'm overriding Docx preprocessor, how would I specify it?

ghost commented 2 years ago

in the next version (when i or someone finally manages to make it work), the preprocessors will be configurable per file type

Such feature would completely eradicate the embarrassing freezing issue of searching through epub folders.

Any idea of the delivery time for the next release ?

Thanks for the great tool btw !

phiresky commented 1 year ago

Starting with 1.0.0, it's possible to add custom adapters via the config file. If someone has a good suggestion for a file type please post it in show-your-adapter