phiresky / ripgrep-all

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.
Other
6.4k stars 148 forks source link

ripgrep-all caches multiple times the same file if different lists of irrelevant adapters are active #177

Open vejkse opened 11 months ago

vejkse commented 11 months ago

Describe the bug

I had observed previously that the cache had become abnormally large, even though only a few files had been added to the directories I was searching. This was with the unfamiliar kind of database previously used.

Now that the cache is an SQLite database, I understand what probably happened. Adding, for instance, a custom adapter for djvu files changes the list of active adapters that is recorded in the database for a PDF file, even though the djvu adapter is not applicable to that file.

Shouldn’t this list only contain adapters that are applicable to the given file, to avoid recaching the file each time an irrelevant adapter is added, modified or disabled?

To Reproduce

Run rga --rga-cache-path=/tmp/throwaway_cache abc in a directory containing one pdf file named xyz.pdf, then run rga --rga-cache-path=/tmp/throwaway_cache --rga-adapters=-ffmpeg abc, disabling the ffmpeg adapter, which doesn’t apply to pdf files.

Output

The preproc_cache table in /tmp/throwaway_cache/cache.sqlite3 contains two entries for xyz.pdf, one where the field active_adapters contains ffmpeg.v1 and one where it doesn’t, so that the cache has twice the size it could have.

Output of rga --version

ripgrep-all 1.0.0-alpha.5 (commit 16b2059).

phiresky commented 11 months ago

This should only be happening for archives, where the list of adapters (even others than the one for the main file) can affect the result of preprocessing. if it happens for all files it's a bug