Handling of malformed iframe tags

rushter commented 3 years ago

I've noticed a pretty annoying problem on some websites (I think there are at least a thousand of them in Alexa 1M).

An unclosed Iframe tag breaks all the HTML below it.

Here is an example:

<noscript>
    <iframe
            height="0" width="0" data-src="https://www.googletagmanager.com/ns.html?id=GTM-M5RK4MW" class="lazyload"
            src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==">
        <noscript>
            <iframe src="https://www.googletagmanager.com/ns.html?id=GTM-M5RK4MW"
                    height="0" width="0">
        </noscript>
    </iframe>
</noscript>

It's missing the closing iframe tag but still works when parsing it using Modest.

But for some reason, if you open it in Chrome (to render the javascript parts) and dump HTML, you get this:

<noscript>
<iframe 
      height="0" width="0" data-src="https://www.googletagmanager.com/ns.html?id=" class="lazyload"
       src="data:image/gif;base64,R0lGODlhAQ">
       <noscript>
        <iframe src="https://www.googletagmanager.com/ns.html?id=" height="0" width="0">
</noscript>

Now there are no closing tags for both iframes.

The problem with this is that Modest will ignore everything after such a tag:

<noscript>
<iframe data-src="https://www.googletagmanager.com/ns.html?id=">
</noscript>

<script></script>
<script></script>
<script></script>

Seaching for script nodes using myhtml_get_nodes_by_name or using CSS selectors returns no results.

@lexborisov Are there any ways to improve this? Other parsers can still handle this.

vtorri commented 3 years ago

@rushter try https://github.com/lexbor/lexbor

rushter commented 3 years ago

@rushter try https://github.com/lexbor/lexbor

I maintain a Python binding for Modest, lexbor is not ready to be replaced yet.

omerh2802 commented 1 year ago

any updates on this?

lexborisov commented 1 year ago

Hi @rushter @omerh2802

Maybe we should tell the parser that we have enabled SCRIPT? Then everything in the noscript tag will be treated as text. It won't affect anything else.

tree->flags |= MyHTML_TREE_FLAGS_SCRIPT

    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);

    tree->flags |= MyHTML_TREE_FLAGS_SCRIPT;

    myhtml_parse(tree, MyENCODING_UTF_8, html, length);

omerh2802 commented 1 year ago

Hi @lexborisov, thats sound like a great idea!

lexborisov / Modest

Handling of malformed iframe tags #86