Open awaisdar001 opened 3 years ago
That's really weird. I thought the html parser was just parsing syntax, not also parsing for whether or not the tags were valid HTML tags. I wonder if there's a setting in the HTMLParser for this...
Perhaps we should upgrade to the HTML5Parser? See: https://lxml.de/html5parser.html
Perhaps we should upgrade to the HTML5Parser? See: https://lxml.de/html5parser.html
Good call !!! I will take a look at this, but looking impressive at a glance.
Taken another look, can you add how this can be used similarly as etree.HTMLParser
? The current implementation is as follows.
try:
parser = etree.HTMLParser(recover=False)
content = etree.fromstring(html, parser)
except:
errorstore.add_error(InvalidHTML(new_file, error=e.args[0]))
I think something like the following:
Install library:
pip install html5lib
Code:
from lxml.html import html5parser
html = """
<figure>
<img src="pic_trulli.jpg" alt="Trulli" style="width:100%">
<figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>
"""
parser = html5parser.HTMLParser(strict=True)
content = html5parser.fragment_fromstring(html.encode('utf-8'), parser=parser)
Not sure that it gives quite the same quality of error messages back that lxml was giving though :-(
Interesting, It does give errors I saw but we would have to gather information if it would be useful. I am spending some time today on this and will respond with more information. thanks for jumping in @jolyonb :)
I felt that InvalidHTML ERROR at level 1 is very strict and also it does not understand HTML5 tags for which this not suitable to use in olx validation.
See my example below.
Here is the error:
ERROR InvalidHTML (html/w2_s6_u1_c1.html): Tag figure invalid, line 11, column 54
We should make it silent until there is a better approach to this problem. We can also convert that to Warning for instance (Studio does not break with such invalid html tags). FYI -- @jolyonb