plateaukao / einkbro

A small, fast web browser based on Android WebView. It's tailored for E-Ink devices but also works great on normal android devices.
Other
1.05k stars 77 forks source link

Epubs are often missing images #344

Closed uqs closed 3 months ago

uqs commented 5 months ago

What device and app version are you using

Describe the bug Not all images from a webpage make it into the epub, they simply get omitted from the zipfile, but are referenced in the HTML.

To Reproduce Happens on many sites, e.g. https://historyforatheists.com/2020/07/the-great-myths-9-hypatia-of-alexandria/ For me, the resulting epub/zip has the first 3 images, but all others are 0 bytes!

Inspecting the webpage HTML, I don't see a clue there, they are all referencing JPEGs, although wordpress blows up the tag quite a bit.

For other sites, only the first, or first two jpgs are non-zero in the epub. Sometimes, six of them make it through! Is there a size limit? Does an error on the first image result in all subsequent images no longer getting downloaded?

Is there a way to inspect logs?

Expected behavior I expect no zero-byte img_1_1.jpg files in the epub.

plateaukao commented 5 months ago

I'll check why for this site. If possible, please provide more urls that behave wrong too, and describe what images are gone.

uqs commented 5 months ago

I had a look at EpubManager.kt, and this part here smells suspect:

        doc.select("img").forEachIndexed { index, element ->                                          
            val imgUrl = element.attributes()["src"] ?: element.dataset()["src"] ?: ""                                                                  
            val extension = if (imgUrl.endsWith("png")) "png" else "jpg"                              
            val newImageIndex = "img_${chapterIndex}_$index.$extension"                               
            element.attr("src", newImageIndex)                                                        
            imageKeyUrlMap[newImageIndex] = imgUrl                                                    
        }                                                                                             

I don't know Kotlin, but if endsWith() is just doing a string search, then this will get the extension wrong sometimes. It defaults to JPG though, so I guess it'll be mostly fine.

For the original article, we have img src tags like so (note how they "endWith"):

% curl -so- https://historyforatheists.com/2020/07/the-great-myths-9-hypatia-of-alexandria/ | grep -o "<img[^>]*src=[^>]*" | grep -o "src=.[^\"']*"
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/Hypatia.jpg?resize=769%2C295&amp;ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/800px-Hypatia_by_Julius_Kronberg_1889-1.jpg?resize=640%2C998&#038;ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/AAA-Hypatia-fables-1.jpg?resize=604%2C520&#038;ssl=1
src="https://i1.wp.com/historyforatheists.com/wp-content/uploads/2020/06/Hypatia-Teaching-Alexandria-watercolour-paper-Robert-Trewick.jpg?fit=640%2C411&amp;ssl=1
src="//ir-na.amazon-adsystem.com/e/ir?t=cladesvariana-20&amp;l=am2&amp;o=1&amp;a=0190210036
src="//ir-na.amazon-adsystem.com/e/ir?t=cladesvariana-20&amp;l=am2&amp;o=1&amp;a=B015R4LFGQ
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/PLato-1.jpg?fit=640%2C646&amp;ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/Neoplatonism-1.jpg?resize=640%2C427&#038;ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2017/07/agora-christians-destroying-statues.jpg?resize=640%2C413&#038;ssl=1
src="https://i1.wp.com/historyforatheists.com/wp-content/uploads/2020/07/Hypatia3-1.jpeg?fit=640%2C379&amp;ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/07/Hypatia4.jpg?resize=520%2C361&#038;ssl=1
src="//ir-na.amazon-adsystem.com/e/ir?t=cladesvariana-20&amp;l=am2&amp;o=1&amp;a=0674437764
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/07/Hypatia5.jpg?resize=365%2C599&#038;ssl=1
src='https://secure.gravatar.com/avatar/a0bf531dbcadcb52e27d292574165111?s=50&#038;d=mm&#038;r=g

And I did notice that other zero bytes "images" in some of the chapters correspond to these img tags in their article:

<img decoding="async" src="https://ir-na.amazon-adsystem.com/e/ir?t=slastacod-20&#038;l=as2&#038;o=1&#038;a=0307455777" width="1" height="1" border="0" ...

So ad tracking pixels, and these currently no longer load, so: a) zero byte images are ok at times ... b) timeouts totally make sense, they might fully explain issue #316 if there are many such dead links in the page.

The extension can probably better be inferred by looking at the first few bytes of the content, to see if they start with JFIF or whatever.

But that still doesn't explain why only 3 or so of the real images made it through.

What's the most straightforward way to build and run the app? I'd like to sprinkle some info logging in there, thinking that I can probably observe that with adb logcat?

plateaukao commented 5 months ago

clone the repo, install android studio open the folder in android studio as project, and click run. 😁

plateaukao commented 5 months ago

@uqs In EinkBro, before saving epubs, it will first convert webpage into reader mode, so that most un-wanted elements will be pruned. While testing the link you provided, the header image is gone in reader mode, I guess it's because there's a header element afterward, so everything above it are cleaned by Reader mode. And that's why you can't see the first image.

image

If all the cases you try to save to epubs are from the same site, that may explain why the first or second images are gone. It's not about how images are feched; it's about how web pages are turned into Reader mode.

That's why I need you to provide more mal-function urls, to make sure all of them happen with same root cause.

uqs commented 5 months ago

To be clear, it's not the header image that I worry about. But since you mention Reader mode, I did some more testing. As the resulting epub looks mostly the same, I thought that Reader mode was implied when saving an epub, but I just found that it indeed results in a different conversion?

I first converted the article while in Reader mode to a new epub, then I added the same article again, while not in Reader mode. That results in mostly identical HTML, but it results in more missing images for the non-reader mode! To be clear, in the browser, it makes no difference and all images are where they should be (about 10-11 big images or so)

Resulting files in the zipfile:

% unzip -l hypatia_reader_mode.epub | sort -k4,4
  1293678                     30 files
---------                     -------
Archive:  hypatia_reader_mode.epub
  Length      Date    Time    Name
---------  ---------- -----   ----
      230  2024-02-21 19:39   META-INF/container.xml
    89804  2024-02-21 19:39   OEBPS/chapter1.html
    89900  2024-02-21 19:39   OEBPS/chapter2.html
     2835  2024-02-21 19:39   OEBPS/content.opf
   165778  2024-02-21 19:39   OEBPS/img_1_0.jpg
    54382  2024-02-21 19:39   OEBPS/img_1_1.jpg
        0  2024-02-21 19:39   OEBPS/img_1_10.jpg
    88463  2024-02-21 19:39   OEBPS/img_1_11.jpg
    76407  2024-02-21 19:39   OEBPS/img_1_2.jpg
        0  2024-02-21 19:39   OEBPS/img_1_3.jpg
        0  2024-02-21 19:39   OEBPS/img_1_4.jpg
   187603  2024-02-21 19:39   OEBPS/img_1_5.jpg
    38223  2024-02-21 19:39   OEBPS/img_1_6.jpg
    68220  2024-02-21 19:39   OEBPS/img_1_7.jpg
    84266  2024-02-21 19:39   OEBPS/img_1_8.jpg
    49932  2024-02-21 19:39   OEBPS/img_1_9.jpg
   165778  2024-02-21 19:39   OEBPS/img_2_0.jpg
    54382  2024-02-21 19:39   OEBPS/img_2_1.jpg
        0  2024-02-21 19:39   OEBPS/img_2_10.jpg
        0  2024-02-21 19:39   OEBPS/img_2_11.jpg
    76407  2024-02-21 19:39   OEBPS/img_2_2.jpg
        0  2024-02-21 19:39   OEBPS/img_2_3.jpg
        0  2024-02-21 19:39   OEBPS/img_2_4.jpg
        0  2024-02-21 19:39   OEBPS/img_2_5.jpg
        0  2024-02-21 19:39   OEBPS/img_2_6.jpg
        0  2024-02-21 19:39   OEBPS/img_2_7.jpg
        0  2024-02-21 19:39   OEBPS/img_2_8.jpg
        0  2024-02-21 19:39   OEBPS/img_2_9.jpg
     1048  2024-02-21 19:39   OEBPS/toc.ncx
       20  2024-02-21 19:39   mimetype

Note how the second chapter has only img 1 and 2 in this case, and zero bytes for all others. For the first chapter, also some of them are missing, but much fewer.

Is this reproducible on your end with this one webpage? What does your epub/zip look like in the end?

As for the diff in the HTML? The Reader mode does not retain the original link (which is really a bummer to lose):

Reader mode epub: 54 - 68 minutes Normal mode epub: 54 - 68 minutes | original link

I would really hate to have to give up that "original link". That's the only relevant diff for those 2 modes, the other changes are only linking to img_2_1.jpg instead of img_1_1.jpg of course.

At least I have Android Studio and the app up and running but cannot repro this in the emulator (yet). Now off to learn me some Kotlin.

plateaukao commented 5 months ago

I'll check the epub I created next weekend. I have some days offbthis week.

uqs commented 4 months ago

Any comment on PR #346 ? The failing check seems unrelated ...

plateaukao commented 4 months ago

still in vacation. please wait.

uqs commented 3 months ago

After all the recent fixes, I can no longer reproduce this.