phiresky / ripgrep-all

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.
Other
7.98k stars 172 forks source link

Cache converter/Read from Cache #165

Closed FergusFettes closed 1 year ago

FergusFettes commented 1 year ago

Hi, is there a way to get the cache data directly? I tried opening it with python lmdb lib but its too compressed for python to figure it out.

The thing is, I'm using rga to process lots of image files, and tesseract takes a while so it would be good to be able to reuse the output.

I could just run tesseract over all the files myself, but seems like it would be nice to be able to reuse the functionality that rga already has.

Unfortunately I'm not a rust programmer or I would have spent more time trying to figure this out. Any pointers appreciated.

This issue is somewhat similar to this one: https://github.com/phiresky/ripgrep-all/issues/56

phiresky commented 1 year ago

the values you should be able to "easily" decompress with https://github.com/indygreg/python-zstandard. the keys (file names) i'm not remembering right now how i'm encoding them.

this seems like an argument to switching to sqlite though which has better tooling / is easier to work with.

FergusFettes commented 1 year ago

Thanks! Maybe the db could be switched with a flag?

Great lib btw, I'm using it to keep track of my activity on my laptop, I have this running every minute:

dir=`date '+%Y-%m-%d'`
filename=`date '+%H-%M-%S'`
mkdir -p /da/caps/ocr/$dir

flameshot full --path /da/caps/ocr/$dir/$filename.png
ocrmypdf /da/caps/ocr/$dir/$filename.png /da/caps/ocr/$dir/$filename.pdf

rm /da/caps/ocr/$dir/$filename.png

which takes a screencap and runs ocr on it, and now with rga I can search through all my activity history :D.

phiresky commented 1 year ago

That's a neat idea!

I guess you can get the cached data by doing rga-preproc FILE. That'll get the text version of the file FILE from cache (if it's cached, otherwise it will compute it).

Also, the integrated tesseract functionality will probably be removed in 1.0 because it was kinda niche and not very clean. It'll be possible to readd it as a custom extractor though. It sounded like you were using it though your script doesn't.

FergusFettes commented 1 year ago

ocrmypdf is just a wrapper around tesseract.

phiresky commented 1 year ago

The cache db is now sqlite in the 1.0.0 alpha. The values are still compressd but reading them should be fairly easy