[FEATURE] Minimal file reads during initial scan

Strontium commented 5 months ago

Is your feature request related to a problem? Please describe. When working with a remote file system to the stump host, like keeping files on a cloud storage provider, there may be both bandwidth and egress limits when accessing data. Currently when doing an initial scan for a book, stump appears to request to read the entire file. When using cloud storage via an rclone mount, this scan causes the file to be downloaded in its entirety, which is inefficient and likely slow for large files. Opening the book for reading, stump appears to only request the pages it requires, hence rclone will only download those parts of the file, reducing traffic and loading times.

Describe the solution you'd like Allow an option for reducing file access during the initial scan. Scan should be limited to:

reading the first page for thumbnail creation
reading metadata Could be an option for 'minimal scan' if there are other benefits to reading the entire file. Noting this may only be supported on file types that allow reading data without the entire file (ie ZIP with no compression).

Describe alternatives you've considered Stump being remote storage aware and will limit itself appropriately automatically.

Additional context In comparison to other applications in my testing with an rclone mount:

Komga: Reads full file on SCAN and full file on OPEN
Kavita: Reads partial file on SCAN and full file on OPEN
Stump: Reads full file on SCAN and partial file on OPEN

aaronleopold commented 5 months ago

Stump: Reads full file on SCAN and partial file on OPEN

The main reason Stump reads the full file on scan is to determine the actual page count. That operation involves iterating through each file e.g. in an archive to determine whether it is a valid page (re: image file). The validity check uses actual byte content, and falls back onto the extension, as a way of attempting to be more accurate in knowing what is truly a valid page. The only feasible way to allow for partial reads would be:

Drop the accuracy and operate only on file extensions, trusting the extension is always correct
Don't count the pages and rely on metadata. This has problems in that missing or inaccurate metadata would cause problems.

This could be a configuration. I'll have to think on implementation details, though.

Strontium commented 5 months ago

Is it often you would have invalid file with an incorrect extension? What is the downside to having a potentially inaccurate page count? I think that dropping that level of accuracy is a reasonable trade-off to improve both the initial scan performance and in my case, reducing remote traffic. If it is easier to implement, i'd still be happy even if it was an optional feature and/or there were some pre-requisites to making it work (ie having the required metadata in the file). Thanks for considering it.

aaronleopold commented 5 months ago

I can't speak to how often you would have a file inside an archive with the wrong/invalid extension. I'd hope it isn't often, and FWIW I haven't encountered the situation personally 😅

What is the downside to having a potentially inaccurate page count?

If there are more pages than what is observed, you likely won't be able to access any of the content. For example, if there are actually 30 pages but for some reason Stump only observed 28 valid pages, API validation would essentially prevent you from even trying to query past the 28th page.
If there are fewer pages than what is observed, internal server errors will start to be thrown as Stump attempts to extract nonexistent pages from the file

Not the end of the world, just things to consider as part of the trade-off. I'll try to see what the general consensus is for this change in behavior before committing to it

aaronleopold commented 3 months ago

https://github.com/stumpapp/stump/pull/353 will be removing that magic header method for determining content type during scans. Once that lands, the only reads during a scan (for ZIP/RAR files) should be when a ComicInfo.xml file is present.

Analysis jobs (still experimental) will still fully open files, but this is separate from scanning.

aaronleopold commented 2 months ago

Completed and released as part of v0.0.4

stumpapp / stump

[FEATURE] Minimal file reads during initial scan #327