gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
590 stars 62 forks source link

When fully set to fully load page, single file downloads images as texts #112

Open takingurstuff opened 1 month ago

takingurstuff commented 1 month ago

I was using an edited version of single file cli to download a site, i checked the downloaded page and they have no images in them command used like follows

./single-file --browser-server=http://localhost:9226 --urls-file=./urls.txt --output-directory=/home/mac-sv/XBX/dl5 --errors-file=/home/mac-sv/XBX/threads5/logs/errors.log --browser-wait-until=networkIdle  --browser-wait-delay=1000 --max-parallel-workers=1

this is what single file downloaded Screenshot 2024-08-10 at 16 50 49 the original page it is supposed to download looks like this: Screenshot 2024-08-10 at 16 51 01

I have set the signal to networkIdle because then everything will be loaded completed, i also added the delay as a failsafe in case the images still somehow failed to load, does this have something to do with the loadDeferredImages flag?

I changed it so that it scrolls after load to load some comments with the scroll

The edited version is avalaible at this fork:

https://github.com/takingurstuff/single-file-cli

gildas-lormeau commented 1 month ago

Can you launch single-file with the option --browser-headless=false (or --browser-debug) and check in the network tab in the DevTools if the images are downloaded or blocked. If they are blocked, can you tell me why? (CORS issue, 40x HTTP response...)

takingurstuff commented 1 month ago

i launched it in headful and the resources downloaded and rendered properly. This happened on a Ubuntu remote machine i was running on. I did a few tests on my local mac machine and it turns out that pages downloaded with google chrome as a browser had all the images. So i just switched out the browser from chromium to google chrome on the linux machine, but the same still happened on linux. There was no problem with google chrome on macOS

takingurstuff commented 1 month ago

here are images of the page sources for more context captured on linux: Screenshot 2024-08-13 at 15 46 42 captured on mac: Screenshot 2024-08-13 at 15 53 00

both downloaded with same ver fo chrome in headful

the issue with the images happened on the highlighted lines

takingurstuff commented 1 month ago

just out of curiosity, which OS is single-file-cli tested on?

gildas-lormeau commented 1 month ago

That's a bit strange...

On my end, I do my tests mainly on macOS, and Windows 11. I can also do some tests easily on Asahi Linux and WSL. I do not have an Ubuntu machine, but I can run tests in a VM. At one time or another, I tested single-file-cli on all these environments.

takingurstuff commented 4 weeks ago

anyways, i found out the issue only happens when i use the browser server options instead of doing the internal launching.

what is the difference between connecting to an external chrome vs an internally launched chrome, because i tested single file with the external chrome and the image was not downloaded but internally launched chrome downloaded absolutely fine.