Let's say you want to read some sort of fiction. You're a fan of it, perhaps. But mobile websites are kind of non-ideal, so you'd like a proper ebook made from whatever you're reading.
You need Python 3.9+ and poetry.
My recommended setup process is:
$ pip install poetry
$ poetry install
$ poetry shell
...adjust as needed. Just make sure the dependencies from pyproject.toml
get installed somehow.
Basic
$ python3 leech.py [[URL]]
A new file will appear named Title of the Story.epub
.
This is equivalent to the slightly longer
$ python3 leech.py download [[URL]]
Flushing the cache
$ python3 leech.py flush
Learn about other options
$ python3 leech.py --help
If you want to put an ePub on a Kindle you'll have to either use Amazon's send-to-kindle tools or convert it. For the latter I'd recommend Calibre, though you could also try using kindlegen directly.
A very small amount of configuration is possible by creating a file called leech.json
in the project directory. Currently you can define login information for sites that support it, and some options for book covers.
Example:
{
"logins": {
"QuestionableQuesting": ["username", "password"]
},
"images": {
"image_fetch": true,
"image_format": "png",
"compress_images": true,
"max_image_size": 100000,
"always_convert_images": true
},
"cover": {
"fontname": "Comic Sans MS",
"fontsize": 30,
"bgcolor": [20, 120, 20],
"textcolor": [180, 20, 180],
"cover_url": "https://website.com/image.png"
},
"output_dir": "/tmp/ebooks",
"site_options": {
"RoyalRoad": {
"output_dir": "/tmp/litrpg_isekai_trash",
"image_fetch": false
}
}
}
Note: The
image_fetch
key is a boolean and can only betrue
orfalse
. Booleans in JSON are written in lowercase. If it isfalse
, Leech will not download any images. Leech will also ignore theimage_format
key ifimages
isfalse
.Note: If the
image_format
key does not exist, Leech will default tojpeg
. The three image formats arejpeg
,png
, andgif
. Theimage_format
key is case-insensitive.Note: The
compress_images
key tells Leech to compress images. This is only supported forjpeg
andpng
images. This also goes hand-in-hand with themax_image_size
key. If thecompress_images
key istrue
but there's nomax_image_size
key, Leech will compress the image to a size less than 1MB (1000000 bytes). If themax_image_size
key is present, Leech will compress the image to a size less than the value of themax_image_size
key. Themax_image_size
key is in bytes. Ifcompress_images
isfalse
, Leech will ignore themax_image_size
key.Warning: Compressing images might make Leech take a lot longer to download images.
Warning: Compressing images might make the image quality worse.
Warning:
max_image_size
is not a hard limit. Leech will try to compress the image to the size of themax_image_size
key, but Leech might not be able to compress the image to the exact size of themax_image_size
key.Warning:
max_image_size
should not be too small. For instance, if you setmax_image_size
to 1000, Leech will probably not be able to compress the image to 1000 bytes. If you setmax_image_size
to 1000000, Leech will probably be able to compress the image to 1000000 bytes.Warning: Leech will not compress GIFs, that might damage the animation.
Note: if
always_convert_images
istrue
, Leech will convert all non-GIF images to the specifiedimage_format
.
If you want to just download a one-off story from a site, you can create a definition file to describe it. This requires investigation and understanding of things like CSS selectors, which may take some trial and error.
Example practical.json
:
{
"url": "https://practicalguidetoevil.wordpress.com/table-of-contents/",
"title": "A Practical Guide To Evil: Book 1",
"author": "erraticerrata",
"chapter_selector": "#main .entry-content > ul:nth-of-type(1) > li > a",
"content_selector": "#main .entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
}
Run as:
$ ./leech.py practical.json
This tells leech to load url
, follow the links described by chapter_selector
, extract the content from those pages as described by content_selector
, and remove any content from that which matches filter_selector
. Optionally, cover_url
will replace the default cover with the image of your choice.
If chapter_selector
isn't given, it'll create a single-chapter book by applying content_selector
to url
.
This is a fairly viable way to extract a story from, say, a random Wordpress installation with a convenient table of contents. It's relatively likely to get you at least most of the way to the ebook you want, with maybe some manual editing needed.
A more advanced example with JSON would be:
{
"url": "https://practicalguidetoevil.wordpress.com/2015/03/25/prologue/",
"title": "A Practical Guide To Evil: Book 1",
"author": "erraticerrata",
"content_selector": "#main .entry-wrapper",
"content_title_selector": "h1.entry-title",
"content_text_selector": ".entry-content",
"filter_selector": ".sharedaddy, .wpcnt, style",
"next_selector": "a[rel=\"next\"]:not([href*=\"prologue\"])",
"cover_url": "https://gitlab.com/Mikescher2/A-Practical-Guide-To-Evil-Lyx/raw/master/APGTE_1/APGTE_front.png"
}
Because there's no chapter_selector
here, leech will keep on looking for a link which it can find with next_selector
and following that link. We also see more advanced metadata acquisition here, with content_title_selector
and content_text_selector
being used to find specific elements from within the content.
If multiple matches for content_selector
are found, leech will assume multiple chapters are present on one page, and will handle that. If you find a story that you want on a site which has all the chapters in the right order and next-page links, this is a notably efficient way to download it. See examples/dungeonkeeperami.json
for this being used.
If you need more advanced behavior, consider looking at...
To add support for a new site, create a file in the sites
directory that implements the Site
interface. Take a look at ao3.py
for a minimal example of what you have to do.
Leech creates EPUB 2.01 files, which means that Leech can only save images in the following format:
See the Open Publication Structure (OPS) 2.0.1 for more information.
Leech can not save images in SVG because it is not supported by Pillow.
Leech uses Pillow for image manipulation and conversion. If you want to use a different image format, you can install the required dependencies for Pillow and you will probably have to tinker with Leech. See the Pillow documentation for more information.
To configure image support, you will need to create a file called leech.json
. See the section below for more information.
You can build the project's Docker container like this:
docker build . -t kemayo/leech:snapshot
The container's entrypoint runs leech
directly and sets the current working directory to /work
, so you can mount any directory there:
docker run -it --rm -v ${DIR}:/work kemayo/leech:snapshot download [[URL]]
If you submit a pull request to add support for another reasonably-general-purpose site, I will nigh-certainly accept it.
Run EpubCheck on epubs you generate to make sure they're not breaking.