openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Failure of `thalesdoc_en_all`: AttributeError: 'NoneType' object has no attribute 'startswith' #155

Closed benoit74 closed 5 months ago

benoit74 commented 7 months ago

We have an error in zimit2 for https://farm.openzim.org/pipeline/2119ed82-41aa-4561-a359-8e75968b445c/debug

Traceback (most recent call last):
  File "/usr/bin/zimit", line 566, in <module>
    zimit()
  File "/usr/bin/zimit", line 464, in zimit
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 89, in main
    return converter.run()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 277, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 490, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/items.py", line 72, in __init__
    ).rewrite(self.content)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/content_rewriting/html.py", line 76, in rewrite
    self.feed(content)
  File "/usr/lib/python3.10/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/usr/lib/python3.10/html/parser.py", line 170, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python3.10/html/parser.py", line 344, in parse_starttag
    self.handle_starttag(tag, attrs)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/content_rewriting/html.py", line 96, in handle_starttag
    self.send(transform_attrs(attrs, url_rewriter, self.css_rewriter))
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/content_rewriting/html.py", line 46, in transform_attrs
    return " ".join(format_attr(*attr) for attr in processed_attrs)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/content_rewriting/html.py", line 46, in <genexpr>
    return " ".join(format_attr(*attr) for attr in processed_attrs)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/content_rewriting/html.py", line 45, in <genexpr>
    processed_attrs = (process_attr(attr, url_rewriter, css_rewriter) for attr in attrs)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/content_rewriting/html.py", line 20, in process_attr
    return (attr[0], url_rewriter(attr[1]))
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/content_rewriting/html.py", line 93, in <lambda>
    url_rewriter = lambda url: self.url_rewriter(url, False)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/url_rewriting.py", line 161, in __call__
    if url.startswith("data:") or url.startswith("blob:"):
AttributeError: 'NoneType' object has no attribute 'startswith'

Looks like we have one more # pyright: ignore which should not be there in my last PR (url can indeed by None and it is an issue).

        if tag == "a":
            url_rewriter = lambda url: self.url_rewriter(  # noqa: E731
                url, False  # pyright: ignore
            )
benoit74 commented 7 months ago

Oh no, this is fixed by 2e40adf (#154), isn't it?

mgautierfr commented 7 months ago

Oh no, this is fixed by 2e40adf (#154), isn't it?

Seems so. At least with #154 as I've rewrite your commit. But yes, it is "fixed".

benoit74 commented 7 months ago

Let's keep it open until we have merged and made a new run to confirm it.

kelson42 commented 7 months ago

Still fail in CSS rewriting https://farm.openzim.org/pipeline/dbd1565e-d5b9-4642-8788-f472bb33c147/debug

mgautierfr commented 7 months ago

This is caused by this page (https://thalesdocs.com/sta/agents/freeradius/download-keys/index.html) which contain invalid inline style as:

<p><img 1px alt="Authentication Agent Settings" id="c0c0c0;" solid src="[../../../images/operator/authentication-agent-settings.png](view-source:https://thalesdocs.com/sta/images/operator/authentication-agent-settings.png)" style="&quot;border:"></p>
benoit74 commented 7 months ago

@mgautierfr What do you intend to do in such a situation?

mgautierfr commented 7 months ago

Pr #175 fixes this by simply returning the original css if we cannot parse it.

mgautierfr commented 7 months ago

For reference, using the "1000 page warc of Thales", the errors are:

Error rewriting CSS
"border:
Error rewriting CSS
"border:
Error rewriting CSS
border: solid 1px #c0c0c0; width= 100% 
Error rewriting CSS
border: solid 1px #c0c0c0; width = 100%
Error rewriting CSS
"width:
Error rewriting CSS
width: 100%; cellspacing=
Error rewriting CSS
width: 100%; cellspacing=
Error rewriting CSS
"border-style:
Error rewriting CSS
"border:
Error rewriting CSS
"border-left-style:
Error rewriting CSS
border: solid 1px #c0c0c0; width= 100% 
Error rewriting CSS
border: solid 1px #c0c0c0; width = 100%
Error rewriting CSS
"border:
Error rewriting CSS
"border:
Error rewriting CSS
border-left-style: solid;border-left-width: 1px;border-left-color: #c0c0c0;border-right-style: solid;border-right-width: 1px;border-right-color: #c0c0c0;border-top-style: solid;border-top-width: 1px;border-top-color: #c0c0c0;border-bottom-style: solid;border-bottom-width: 1px;border-bottom-color: #c0c0c0;w
Error rewriting CSS
border: solid 1px #c0c0c0; width= 100% 
Error rewriting CSS
border: solid 1px #c0c0c0; width = 100%
Error rewriting CSS
width: 100%; cellspacing=
Error rewriting CSS
width: 100%; cellspacing=
Error rewriting CSS
border: solid 1px #c0c0c0; width= 100% 
Error rewriting CSS
border: solid 1px #c0c0c0; width = 100%
kelson42 commented 7 months ago

Do we rewrite anything else than url()? Do we really something sophisticated like a css parser? zimwriterfs approach might be good enough? As fallback?

mgautierfr commented 7 months ago

Hum. I don't like regex rewrite. There is a lot of tricky thing in escaped chars that can fool us. But as a fallback for invalid css, it may be ok, if regex rewrite break on broken css, we can say it is the fault of css, not the rewriter.

benoit74 commented 7 months ago

Fixed by #175