Support partial updates

codetheweb commented 3 years ago

My goal is to have a folder of auto-updating ebooks (cron job). I saw that there's a --cache flag, but even running with the cache on there's a lot of unnecessary processing for any download after the initial one. Would it be possible to add some kind of partial update mode, where if an .epub already exists it checks for and downloads just 1-2 chapters instead of re-assembling the whole book?

I would be happy to add this myself, just want to hear your thoughts and where to begin implementing this.

(Also, you should setup GitHub sponsors / Buy Me a Coffee or something. Would be happy to throw a few bucks your way and I'm sure other folks would too. 😄)

mathiasfoster commented 3 years ago

+1

kemayo commented 2 years ago

I have been holding off on this because I think it's sort of complicated. To brain-dump what I think the complexities are:

I'd need to write new code to parse an existing epub. This isn't hard, but it's something I don't cover currently).
- This isn't just getting the chapter list out -- there's also book-level metadata that needs to be rebuilt.
- Specifically I'm thinking of footnotes as being a pain. They're currently built with ids just based on the number of footnotes found so far; that'd probably need to be changed to a GUID system like many other ids, and then we could just append new footnotes onto the end of an existing file.
I'd maybe need to write new code to alter the existing epub file rather than creating it from scratch. Alternately, document and warn that any edits you've made to metadata we don't explicitly handle will be overwritten.
Some sites wouldn't be compatible with this. A site that's using the crawler method (JSON following the next-chapter links through a story, rather than a table of contents) won't be able to pick up where it left off, unless we store more metadata in the ebooks to cover this case.
We'd need to decide how to handle matching up chapters on the server and locally. Do we just trust the numbering, or do we match on something more specific?
- If the former, we'd have trouble when a chapter is deleted (e.g. some people put up placeholder chapters to announce delays/hiatuses).
- If the latter, we'd have to decide what to do when a chapter vanishes from the server -- do we delete it from the local copy?
Probably an edge case, but this wouldn't fetch edits to an existing chapter.

codetheweb commented 2 years ago

Yeah, after opening this I realized it would probably be a lot more complicated than I thought at first. I think using some kind of intermediate storage like an SQLite database or something might work better than trying to read back data from the generated epub.

Given that I really only need to update books once a day at most, I think just using the cache for now / scraping directly from the web works well enough for now.

mathiasfoster commented 2 years ago

I've put together a scraper that works off RSS feeds — downloads the content, turns into MOBI, and emails to my Kindle. Still needs a bit of work before it's ready to be open sourced unfortunately!

From a user perspective (if this was ever to be integrated into leech) it would make more sense (for my use case) for each chapter to be converted into a new EPUB, rather than altering the combined EPUB to integrate the new chapter.

kemayo / leech

Support partial updates #63