Closed uqs closed 3 months ago
I'll check why for this site. If possible, please provide more urls that behave wrong too, and describe what images are gone.
I had a look at EpubManager.kt
, and this part here smells suspect:
doc.select("img").forEachIndexed { index, element ->
val imgUrl = element.attributes()["src"] ?: element.dataset()["src"] ?: ""
val extension = if (imgUrl.endsWith("png")) "png" else "jpg"
val newImageIndex = "img_${chapterIndex}_$index.$extension"
element.attr("src", newImageIndex)
imageKeyUrlMap[newImageIndex] = imgUrl
}
I don't know Kotlin, but if endsWith() is just doing a string search, then this will get the extension wrong sometimes. It defaults to JPG though, so I guess it'll be mostly fine.
For the original article, we have img src tags like so (note how they "endWith"):
% curl -so- https://historyforatheists.com/2020/07/the-great-myths-9-hypatia-of-alexandria/ | grep -o "<img[^>]*src=[^>]*" | grep -o "src=.[^\"']*"
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/Hypatia.jpg?resize=769%2C295&ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/800px-Hypatia_by_Julius_Kronberg_1889-1.jpg?resize=640%2C998&ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/AAA-Hypatia-fables-1.jpg?resize=604%2C520&ssl=1
src="https://i1.wp.com/historyforatheists.com/wp-content/uploads/2020/06/Hypatia-Teaching-Alexandria-watercolour-paper-Robert-Trewick.jpg?fit=640%2C411&ssl=1
src="//ir-na.amazon-adsystem.com/e/ir?t=cladesvariana-20&l=am2&o=1&a=0190210036
src="//ir-na.amazon-adsystem.com/e/ir?t=cladesvariana-20&l=am2&o=1&a=B015R4LFGQ
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/PLato-1.jpg?fit=640%2C646&ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/06/Neoplatonism-1.jpg?resize=640%2C427&ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2017/07/agora-christians-destroying-statues.jpg?resize=640%2C413&ssl=1
src="https://i1.wp.com/historyforatheists.com/wp-content/uploads/2020/07/Hypatia3-1.jpeg?fit=640%2C379&ssl=1
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/07/Hypatia4.jpg?resize=520%2C361&ssl=1
src="//ir-na.amazon-adsystem.com/e/ir?t=cladesvariana-20&l=am2&o=1&a=0674437764
src="https://i0.wp.com/historyforatheists.com/wp-content/uploads/2020/07/Hypatia5.jpg?resize=365%2C599&ssl=1
src='https://secure.gravatar.com/avatar/a0bf531dbcadcb52e27d292574165111?s=50&d=mm&r=g
And I did notice that other zero bytes "images" in some of the chapters correspond to these img tags in their article:
<img decoding="async" src="https://ir-na.amazon-adsystem.com/e/ir?t=slastacod-20&l=as2&o=1&a=0307455777" width="1" height="1" border="0" ...
So ad tracking pixels, and these currently no longer load, so: a) zero byte images are ok at times ... b) timeouts totally make sense, they might fully explain issue #316 if there are many such dead links in the page.
The extension can probably better be inferred by looking at the first few bytes of the content, to see if they start with JFIF or whatever.
But that still doesn't explain why only 3 or so of the real images made it through.
What's the most straightforward way to build and run the app? I'd like to sprinkle some info logging in there, thinking that I can probably observe that with adb logcat?
clone the repo, install android studio open the folder in android studio as project, and click run. 😁
@uqs In EinkBro, before saving epubs, it will first convert webpage into reader mode, so that most un-wanted elements will be pruned. While testing the link you provided, the header image is gone in reader mode, I guess it's because there's a header element afterward, so everything above it are cleaned by Reader mode. And that's why you can't see the first image.
If all the cases you try to save to epubs are from the same site, that may explain why the first or second images are gone. It's not about how images are feched; it's about how web pages are turned into Reader mode.
That's why I need you to provide more mal-function urls, to make sure all of them happen with same root cause.
To be clear, it's not the header image that I worry about. But since you mention Reader mode, I did some more testing. As the resulting epub looks mostly the same, I thought that Reader mode was implied when saving an epub, but I just found that it indeed results in a different conversion?
I first converted the article while in Reader mode to a new epub, then I added the same article again, while not in Reader mode. That results in mostly identical HTML, but it results in more missing images for the non-reader mode! To be clear, in the browser, it makes no difference and all images are where they should be (about 10-11 big images or so)
Resulting files in the zipfile:
% unzip -l hypatia_reader_mode.epub | sort -k4,4
1293678 30 files
--------- -------
Archive: hypatia_reader_mode.epub
Length Date Time Name
--------- ---------- ----- ----
230 2024-02-21 19:39 META-INF/container.xml
89804 2024-02-21 19:39 OEBPS/chapter1.html
89900 2024-02-21 19:39 OEBPS/chapter2.html
2835 2024-02-21 19:39 OEBPS/content.opf
165778 2024-02-21 19:39 OEBPS/img_1_0.jpg
54382 2024-02-21 19:39 OEBPS/img_1_1.jpg
0 2024-02-21 19:39 OEBPS/img_1_10.jpg
88463 2024-02-21 19:39 OEBPS/img_1_11.jpg
76407 2024-02-21 19:39 OEBPS/img_1_2.jpg
0 2024-02-21 19:39 OEBPS/img_1_3.jpg
0 2024-02-21 19:39 OEBPS/img_1_4.jpg
187603 2024-02-21 19:39 OEBPS/img_1_5.jpg
38223 2024-02-21 19:39 OEBPS/img_1_6.jpg
68220 2024-02-21 19:39 OEBPS/img_1_7.jpg
84266 2024-02-21 19:39 OEBPS/img_1_8.jpg
49932 2024-02-21 19:39 OEBPS/img_1_9.jpg
165778 2024-02-21 19:39 OEBPS/img_2_0.jpg
54382 2024-02-21 19:39 OEBPS/img_2_1.jpg
0 2024-02-21 19:39 OEBPS/img_2_10.jpg
0 2024-02-21 19:39 OEBPS/img_2_11.jpg
76407 2024-02-21 19:39 OEBPS/img_2_2.jpg
0 2024-02-21 19:39 OEBPS/img_2_3.jpg
0 2024-02-21 19:39 OEBPS/img_2_4.jpg
0 2024-02-21 19:39 OEBPS/img_2_5.jpg
0 2024-02-21 19:39 OEBPS/img_2_6.jpg
0 2024-02-21 19:39 OEBPS/img_2_7.jpg
0 2024-02-21 19:39 OEBPS/img_2_8.jpg
0 2024-02-21 19:39 OEBPS/img_2_9.jpg
1048 2024-02-21 19:39 OEBPS/toc.ncx
20 2024-02-21 19:39 mimetype
Note how the second chapter has only img 1 and 2 in this case, and zero bytes for all others. For the first chapter, also some of them are missing, but much fewer.
Is this reproducible on your end with this one webpage? What does your epub/zip look like in the end?
As for the diff in the HTML? The Reader mode does not retain the original link (which is really a bummer to lose):
Reader mode epub: 54 - 68 minutes Normal mode epub: 54 - 68 minutes | original link
I would really hate to have to give up that "original link". That's the only relevant diff for those 2 modes, the other changes are only linking to img_2_1.jpg instead of img_1_1.jpg of course.
At least I have Android Studio and the app up and running but cannot repro this in the emulator (yet). Now off to learn me some Kotlin.
I'll check the epub I created next weekend. I have some days offbthis week.
Any comment on PR #346 ? The failing check seems unrelated ...
still in vacation. please wait.
After all the recent fixes, I can no longer reproduce this.
What device and app version are you using
Describe the bug Not all images from a webpage make it into the epub, they simply get omitted from the zipfile, but are referenced in the HTML.
To Reproduce Happens on many sites, e.g. https://historyforatheists.com/2020/07/the-great-myths-9-hypatia-of-alexandria/ For me, the resulting epub/zip has the first 3 images, but all others are 0 bytes!
Inspecting the webpage HTML, I don't see a clue there, they are all referencing JPEGs, although wordpress blows up the
tag quite a bit.
For other sites, only the first, or first two jpgs are non-zero in the epub. Sometimes, six of them make it through! Is there a size limit? Does an error on the first image result in all subsequent images no longer getting downloaded?
Is there a way to inspect logs?
Expected behavior I expect no zero-byte img_1_1.jpg files in the epub.