gildas-lormeau / SingleFile

Web Extension for saving a faithful copy of a complete web page in a single HTML file
GNU Affero General Public License v3.0
14.8k stars 971 forks source link

Manually assigned slots not correctly preserved #1526

Open danny0838 opened 1 month ago

danny0838 commented 1 month ago

Describe the bug The status of manually assigned slots is not correctly preserved in the captured page.

Manually assigned slots is supported since Chromium 86 and Firefox 92. I have seem websites using slot elements but not sure whether this feature is used in a real world site or not.

To Reproduce Steps to reproduce the behavior:

  1. Create the test page and visit it:

    <!DOCTYPE html>
    <meta charset="UTF-8">
    <div>
    <span>Default</span>
    <span slot="">Default2</span>
    Default3
    <span slot="person">Mr. Apple</span>
    <span slot="person">Mr. Black</span>
    <span slot="person">Ms. Cindy</span>
    <template shadowrootmode="*open" shadowrootslotassignment="manual">
      <style>
      slot { display: block; }
      ::slotted(*) { background-color: yellow; }
      </style>
      <slot>default missing</slot>
      <slot name="person">person missing</slot>
      <div>
      <span slot="person">person1</span>
      <span slot="person">person2</span>
      <span slot="person">person3</span>
      <template shadowrootmode="*open" shadowrootslotassignment="manual">
        <style>
        slot { display: block; }
        ::slotted(*) { background-color: yellow; }
        </style>
        <slot name="person">person missing</slot>
      </template>
      </div>
    </template>
    </div>
    <script>
    function loadShadowDoms(root = document, {recursive = true, clear = true} = {}) {
      for (const t of root.querySelectorAll('template[shadowrootmode]')) {
        const elem = t.parentNode;
        if (!elem.shadowRoot) {
          // Allow using e.g. shadowrootmode="*open" for downward compatibility,
          // e.g. to prevent an issue that ShadowRoot.delegatesFocus is supported
          // while template[shadowrootdelegatesfocus] is not supported.
          let mode = t.getAttribute('shadowrootmode');
          if (!['open', 'closed'].includes(mode)) {
            if (mode[0] === '*') { mode = mode.slice(1); }
          }
    
          const clonable = t.hasAttribute('shadowrootclonable');
          const delegatesFocus = t.hasAttribute('shadowrootdelegatesfocus');
          const serializable = t.hasAttribute('shadowrootserializable');
          const slotAssignment = t.getAttribute('shadowrootslotassignment') || undefined;
          const shadow = elem.attachShadow({mode, clonable, delegatesFocus, serializable, slotAssignment});
          shadow.innerHTML = t.innerHTML;
          if (recursive) {
            loadShadowDoms(shadow, {recursive, clear});
          }
        }
        if (clear) {
          t.remove();
        }
      }
    }
    loadShadowDoms();
    
    var host = document.querySelector('div');
    var shadow = host.shadowRoot;
    var slots = shadow.querySelectorAll('slot');
    var spans = host.querySelectorAll('span');
    slots[0].assign(spans[0], spans[1].nextSibling);
    slots[1].assign(spans[2], spans[3]);
    
    var host = shadow.querySelector('div');
    var shadow = host.shadowRoot;
    var slots = shadow.querySelectorAll('slot');
    var spans = host.querySelectorAll('span');
    slots[0].assign(spans[1], spans[2]);
    </script>
  2. Save the page with default options
  3. Open the saved page

Expected behavior The captured page should show:

Default Default3
Mr. AppleMr. Black
person2person3

as in the original page.

Environment

gildas-lormeau commented 1 month ago

Are you suggesting to add a non-standard shadowrootslotassignement attribute to the template tag?

I do not see it here: https://html.spec.whatwg.org/#the-template-element (I just discovered the declarative shadow DOM has been merged into the spec. BTW). If it's not standard, it should be at least prefixed to avoid potential forward compatibility issues (Mootools, I haven't forgotten you).

I was not aware that other shadowroot* attributes were also standardized. I will fix them, except shadowrootslotassignement for now.

danny0838 commented 1 month ago

No, it's for making test cases only and I'm not suggesting anything. Actually it will never work even used, since template[shadowmode]s will be loaded into shadow DOMs and the script won't be able to read their attributes.

Currently WebScrapBook handles shadow DOM related things with scripts, mostly for historical reasons because it's so recently that I know of the declarative shadow DOMs. Actually I'm investigating whether to switch to HTMLifying or add as another option.

Though, it seems tricky to HTMLify manually assigned slots. A reasonable strategy is probably to forget the slotAssignment property and rewrite the slot attribute of each slottables. For a non-assigned slottable, maybe assign a new random unused string? But assigning slot attribute doesn't work for a slottable text node, it may be required to rewrite it as a span element, with a risk of breaking the page layout.

HTMLifying with declarative shadow DOMs may also introduce a potential compatibility issue for older browsers. As the code comment in the OP says, if a browser supports ShadowRoot.delegatesFocus but not yet template[shadowrootdelegatesfocus], the loaded shadow DOM will lose the property when handled declaratively rather than scripts.

BTW, do you have a test suite for page saving?

gildas-lormeau commented 1 month ago

Supporting this property as part of SingleFile seems tedious as you described it, I'll have to think about it for a while too... Thanks again for creating this issue which will allow me to dive into this problem at any time. By the way, I've just seen that getHTML() is supported in all major browsers now, cf. https://developer.mozilla.org/en-US/docs/Web/API/Element/getHTML and https://developer.mozilla.org/en-US/docs/Web/API/ShadowRoot/getHTML. That's good news :) even if we still have to polyfill it meanwhile (i.e. during the next 5 years).

In fact, historically, SingleFile was more or less a hack for Chrome because it did not support the MHTML format. I've always seen it as a "fun" project, so I don't have a test suite. It has also become a lot more popular than I thought initially. In practice, I fix a lot of bugs/regressions, but it's still fun and plenty of surprises :) (It's good for the memory too). In practice, I often use the examples on the MDN pages for quick tests (e.g. the iframe element). I've also discovered that saving press articles in different languages is also very effective for finding bugs. It seems that every region in the world has its own bug patterns.

danny0838 commented 1 month ago

I don't find getHTML() interesting because what it does can already be done (even though with polyfills), and as a general API it probably cannot take care of closed shadow roots, which requires a special API under the extension context.

Although Chrome supports MHTML now, I don't really like it. It's too difficult to write a script to analyze or convert. It's not UTF-8 encoded and all Asian texts are encoded and it's difficult to even perform a naive plain text search in the source code. Let alone that built-in module for page saving is too weak to handle so many things, for both html (+supporting folder) or mhtml.

MAFF is so far the most promising format for single page archiving but unfortunately no browser vendor is active to support it. I don't think it's really so difficult to implement by e.g. extending the module that handles the jar: protocol in Firefox or extending the zipped extension handling mechanism in Chromium...

WSB has been having a test suite to systematically verify many capture cases since 7 years ago. I personally find it helpful for code refactoring, regression prevention, and dealing with browser compatibility issues. The test cases can be either tested with an automated script or by manually opening in the browser and seeing whether it looks identical after a capture.

Actually the recently reported issues are mostly come from it (e.g. manual slots, form, adopted stylesheets, namespace). Additionally I have still seen several cases that are not handled appropriately by SF, but will probably be too lazy to recheck and report all of them. Just FYI, You can test them on your own if you're interested in.

gildas-lormeau commented 1 month ago

You're right, getHTML() isn't terribly useful especially under the extension context.

I'm not a fan of MHTML either, I thought it would have caught on anyway when Google implemented it in Chrome. I agree that the MAFF/ZIP format seems to be the simplest and most suitable for saving web pages. The self-extracting format in SingleFile is quite similar actually, it's a zip file and it's supported in all browsers (since I managed to get around the need to rely on fetch() to read the binary content in the page). It's "just" a bit complicated to generate it (see the presentation here for more info)

I also agree that a test suite is really desirable in this kind of project. I'm very grateful for your permission to use yours. I'll have to take a look. I guess it could be relatively easy to run them automatically with SingleFile CLI.