gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
602 stars 63 forks source link

CLI doesn't capture Web Components #58

Open tomaszferens opened 11 months ago

tomaszferens commented 11 months ago

When using the CLI to save a web page that contains Web Components (Salesforce's Lightning Web Components in this case) it produces invalid HTML file.

The issue is tied specifically to the CLI, because extension produces a correct HTML file.

To Reproduce

  1. create new sandbox (takes 3 minutes to spin up a machine) and then click on Launch
  2. Clone SingleFile CLI repo, install packages and run: npm exec single-file "https://example.com" -- --browser-wait-delay=20000 --browser-headless=false
  3. Copy the link from step 1) and navigate to it (you have 20 seconds to do it, the time can be configured in the command using --browser-wait-delay option).
  4. Wait until SingleFile does the snapshot
  5. Result: webcomponents.html.zip

Extension result: extension-webcomponents.html.zip

Extension ✅ : image

CLI ❌ : image

gildas-lormeau commented 11 months ago

Generally speaking, SingleFile CLI should support web components. For example, it can save https://bugs.chromium.org/p/chromium/issues/detail?id=1040752 which has hundreds of them almost properly. The only issue is related to the fact that a <table> tag is missing.

Do I need an account on SalesForce to do your test? I cannot reproduce the procedure you described because I get a login page when pasting the URL on step 3.

tomaszferens commented 11 months ago

Thanks for looking into that. I think this might be an issue specific to their web components then.

Do I need an account on SalesForce to do your test? I cannot reproduce the procedure you described because I get a login page when pasting the URL on step 3.

No, you don't need an account. Just click on the link from step 1) create new sandbox, wait a minute, and then click on "Launch" button:

image

gildas-lormeau commented 11 months ago

I think I identified the cause of the issue. Actually the Aura components overwrite properties like innerHTML. I noticed when debugging the code in the extension that their innerHTML values are not empty, but they are empty when I inspect elements in the Dev Tools. Actually, the code of the extension is able to read the native value of innerHTML because it has an access to a "protected" DOM (that cannot be overwritten by scripts on the page). The CLI tool (and the Dev Tools) does not have such a "protected" DOM and read the overwritten value instead of the native value of innerHTML, i.e. an empty string.

tomaszferens commented 11 months ago

Is it possible to instruct puppeteer to read a native value of innerHTML? Or is extension more powerful in this case and there is no workaround for puppeteer?

gildas-lormeau commented 11 months ago

Actually the correct term is "isolated world". Unfortunately, I confirm this feature does not exist in puppeteer today, see https://github.com/puppeteer/puppeteer/issues/2671. I guess a workaround could consist of running the browser in non-headless mode with SingleFile installed as extension, but that would require some work in order to communicate with SingleFile (or a fork of it).

tomaszferens commented 11 months ago

Thanks @gildas-lormeau. I found Page.createIsolatedWorld in CDP. I wonder if I could use that with puppeteer to fix the issue. From what I understand I would need to create this isolated world for a page and each frame within it.

gildas-lormeau commented 11 months ago

@tomaszferens Maybe, I did some tests but I was not able to make it work. If you want to do some tests easily in SingleFile CLI, you can apply the changes in the file https://github.com/gildas-lormeau/single-file-cli/blob/master/back-ends/puppeteer.js.

gildas-lormeau commented 6 months ago

The version 2.x is now using isolated worlds.