kemayo / leech

Turn a story on certain websites into an ebook for convenient reading
MIT License
158 stars 24 forks source link

Fix for Issue #2 #89

Closed Jemeni11 closed 5 months ago

Jemeni11 commented 1 year ago

EDIT: Fix for Issue #2 Here's a snippet of the new README


Images support

Leech creates EPUB 2.01 files, which means that Leech can only save images in the following format:

See the Open Publication Structure (OPS) 2.0.1 for more information.

Leech can not save images in SVG because it is not supported by Pillow.

Leech uses Pillow for image manipulation and conversion. If you want to use a different image format, you can install the required dependencies for Pillow and you will probably have to tinker with Leech. See the Pillow documentation for more information.

By default, Leech will try and save all non-animated images as JPEG. The only animated images that Leech will save are GIFs.

To configure image support, you will need to create a file called leech.json. See the section below for more information.

Configuration

A very small amount of configuration is possible by creating a file called leech.json in the project directory. Currently you can define login information for sites that support it, and some options for book covers.

Example:

{
    "logins": {
        "QuestionableQuesting": ["username", "password"]
    },
    "images": true,
    "image_format": "png",
    "compress_images": true,
    "max_image_size": 100000,
    "cover": {
        "fontname": "Comic Sans MS",
        "fontsize": 30,
        "bgcolor": [20, 120, 20],
        "textcolor": [180, 20, 180],
        "cover_url": "https://website.com/image.png"
    },
    "output_dir": "/tmp/ebooks",
    "site_options": {
        "RoyalRoad": {
            "output_dir": "/tmp/litrpg_isekai_trash"
        }
    }
}

Note: The images key is a boolean and can only be true or false. Booleans in JSON are written in lowercase. If it is false, Leech will not download any images. Leech will also ignore the image_format key if images is false.

Note: If the image_format key does not exist, Leech will default to jpeg. The three image formats are jpeg, png, and gif. The image_format key is case-insensitive.

Note: The compress_images key tells Leech to compress images. This is only supported for jpeg and png images. This also goes hand-in-hand with the max_image_size key. If the compress_images key is true but there's no max_image_size key, Leech will compress the image to a size less than 1MB (1000000 bytes). If the max_image_size key is present, Leech will compress the image to a size less than the value of the max_image_size key. The max_image_size key is in bytes. If compress_images is false, Leech will ignore the max_image_size key.

Warning: Compressing images might make Leech take a lot longer to download images.

Warning: Compressing images might make the image quality worse.

Warning: max_image_size is not a hard limit. Leech will try to compress the image to the size of the max_image_size key, but Leech might not be able to compress the image to the exact size of the max_image_size key.

Warning: max_image_size should not be too small. For instance, if you set max_image_size to 1000, Leech will probably not be able to compress the image to 1000 bytes. If you set max_image_size to 1000000, Leech will probably be able to compress the image to 1000000 bytes.

Warning: Leech will not compress GIFs, that might damage the animation.


Old:

Partial Fix for Issue #2

Thanks to @IdanDor for this pull request.

Specifically, added image_selector for arbitrary sites that allows selecting img tags from chapters, downloading them and embedding them within the resulting epub. In the case of Pale, this means that the character banners and extra materials do not require an internet connection to view. Also made the two pale.json's more consistent (pale.json now correctly includes the title of the chapters). https://github.com/kemayo/leech/pull/84#issue-1436128961

This doesn't work for other sites (like fiction.live) so I did this:

else:
  soup = BeautifulSoup(chapter.contents, 'html5lib')
  for count, img in enumerate(soup.find_all('img')):
    img_contents = get_image_from_url(img['src']).read()
    chapter.images.append(Image(
      path=f"images/ch{i}_leechimage_{count}.png",
      contents=img_contents,
      content_type='image/png'
    ))
    img['src'] = f"../images/ch{i}_leechimage_{count}.png"
    if not img.has_attr('alt'):
      img['alt'] = f"Image {count} from chapter {i}"   

It builds up on @IdanDor code as well since it adds all the images it can find to the chapter.images list:

# Add all pictures on this chapter as well.
for image in chapter.images:
  # For/else syntax, check if the image path already exists, if it doesn't add the image.
  # Duplicates are not allowed in the format.
  for other_file in chapters:
     if other_file.path == image.path:
          break
      else:
          chapters.append(EpubFile(path=image.path, contents=image.contents, filetype=image.content_type))

I only tested this with stories from fiction.live but they've all worked fine. I also ran the epubs made through epubcheck and there were no fatals only minor errors.

Just like you wrote in the linked issue, I thought it should something one can somehow disable. And the selector simply matches in my mind what the codebase does with every other "choice". https://github.com/kemayo/leech/pull/84#issuecomment-1318061676

I would not even know where to start with making images an option which is why I called this a partial fix

Jemeni11 commented 1 year ago

Ah there's a problem with this The png format is huge so in a story with many images, you can end up with a massive epub file. So maybe some image compression is needed as well? And conversion to jpg/jpeg which is a lot smaller?

EDIT: No really, I accidentally downloaded a story that was 1.5 GB in size so be careful :laughing:

Jemeni11 commented 1 year ago

Turns out on fiction.live, you can have an empty image tag. Just <img /> no src. Crazy!

TheMetalCenter commented 1 year ago

I tested this out on the Wandering Inn (https://wanderinginn.com/table-of-contents/), which has images in, for example, the Cover page, title page, and chapter 1.02, but it fails to detect all but one of the images. I imagine this has to do with how images are embedded in the HTML on this Wordpress site, but I'm still parsing it out.

Edit: Ah, the issue was I was my json filter selector was preventing them from being read. All of the pictures are detected now, but most fail to load for some reason. It may be an issue with my ebook viewer, however (Calibre).

Edit2: Confirmed the images show up on my Kindle, so it's a Calibre issue that they are broken in their e-book viewer. Thank you and @IdanDor for your work on adding this feature!

Jemeni11 commented 1 year ago

There's this weird image-hosting site called filepicker.io that's causing problems when you try to download from it. This new commit should fix it. The fix: https://github.com/JimmXinu/FanFicFare/issues/933#issuecomment-1483848726

Jemeni11 commented 1 year ago

These new updates work for me but I only tested them on one site (fiction.live)

Jemeni11 commented 1 year ago

This code doesn't download images in xenforo spoilers yet. This will be fixed soon.

EDIT: These xenforo spoiler images are weird. The images get downloaded twice for some reason.