gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
539 stars 57 forks source link

Really Long web pages #85

Closed sanjeevApple closed 2 months ago

sanjeevApple commented 2 months ago

For questions, please create or update a thread here: https://github.com/gildas-lormeau/SingleFile/discussions

Please ensure that you do not find an answer before reporting the issue:

Describe the bug What would be the best way to handle really long web pages. In order to capture the entire page, a scroll to the bottom is needed before the page can be captured with all the relevant images. What would be the recommended way to do this? Can this be done through user-scripts? Also as a hack, I tried using --browser-height=35000, which would force to render the entire page and simulate a scroll to the bottom. This approach works for some pages but for others it simply hangs for hours. For example,

./single-file http://www.apple.com/airpods-max/ airpods-max.html --browser-height=35000 --browser-wait-until="load" --browser-executable-path="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"

this simply hangs. I debugged it to cdp-client.js::getPageData, to

const { result } = await Runtime.evaluate({ expression: singlefile.getPageData(${JSON.stringify(options)}), awaitPromise: true, returnByValue: true, contextId });

this never returns and simply hangs...

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment

Additional context Add any other context about the problem here.

gildas-lormeau commented 2 months ago

Thank you, I was able to reproduce and circumvent this issue. The fix is available in the version 2.0.36 I've just published.

gildas-lormeau commented 2 months ago

Regarding the loading of deferred content, your approach is the most reliable. Most of the time, the mechanisms in SingleFile are sufficient though. The --load-deferred-images-keep-zoom-level option also often gives better results.

sanjeevApple commented 2 months ago

Hi Gildas,

Thanks for looking into this. It seems like there is another issue, the fix works for Chrome but not for Deno browser, which is not a big deal, but wanted to let you know.

So this works:

./single-file https://www.apple.com/macbook-air macbook-air.html --browser-executable-path="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --browser-height=50000 --browser-wait-until="load" --load-deferred-images-keep-zoom-level=true

but this hangs when I remove the browser executable path and use Deno

./single-file https://www.apple.com/macbook-air macbook-air.html --browser-height=50000 --browser-wait-until="load" --load-deferred-images-keep-zoom-level=true

Thanks a lot for all your help. Cheers...

sanjeevApple commented 2 months ago

Hi,

There is another issue, since I don't know the scrolling height of the web page, I picked a very large number for pixel height to accommodate all pages, some pages work, but airpods-max still hangs. This is with the latest build.

./single-file https://www.apple.com/airpods-max airpods-max.html --browser-executable-path="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --browser-height=50000 --browser-wait-until="load" --load-deferred-images-keep-zoom-level=true

still hangs. So there might be another issue. www.apple.com with scrolling pixel height of 20000 works with browser height of 50000, but airpods-max with scrolling pixel height of roughly 30000 hangs with browser height of 50000.

sanjeevApple commented 2 months ago

It is back to the original problem, for some reason the fix didn't help. Even with scrolling pixel height of roughly 30000 hangs with browser height of 30000 for airpods-max.

./single-file https://www.apple.com/airpods-max airpods-max.html --browser-executable-path="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --browser-height=30000 --browser-wait-until="load" --load-deferred-images-keep-zoom-level=true

hangs..

sanjeevApple commented 2 months ago

Also what I observed is that if you download the airpods-max page in a browser, physically scroll to the bottom, and use the plugin to save the file, there is no problem at all and also it is superfast, under 5 seconds, guess the page is already in the browser. But when you use single-file cli, there are a couple of problem, one is the hang, and second is the speed, takes a minute or so compared to 5 seconds. Wondering what is so different between browser extension vs cli, even though it is using the same chrome browser to render the web page and then saving it. The only other difference would be physical scrolling vs --browser-height=50000.

Thanks...

mupavan commented 6 days ago

Thanks a lot for your work. This tool has been very useful for offline use. But I'm running into the same issue as @sanjeevApple. Basically the script hangs at execution of singlefile.getPageData. I'm running in a headless mode with chromium (tested with brave as well). It's not hanging for all pages but it is hanging for pages like https://www.thesun.co.uk/.

cc @LawrenceMMStewart

gildas-lormeau commented 3 days ago

@mupavan Are you able to save this URL with single-file from the command line interface?

LawrenceMMStewart commented 1 day ago

Hi @gildas-lormeau .

Foremost, thank you making single-file, its some great tooling :)

Regarding the above, we observed examples that hang such as :

/root/single-file-cli/single-file https://nypost.com/2022/03/03/ukraine-president-zelensky-survived-three-assassination-attempts/

This occurs for multiple different chromium based binaries (i tried changing browser to see if this was the cause).