openedx / olxcleaner

Tool for checking edX courses for errors and creating content reports
GNU General Public License v3.0
3 stars 6 forks source link

InvalidHTML validation at stage 1 is very strict and does not allow HTML5 tags. #7

Open awaisdar001 opened 3 years ago

awaisdar001 commented 3 years ago

I felt that InvalidHTML ERROR at level 1 is very strict and also it does not understand HTML5 tags for which this not suitable to use in olx validation.

See my example below.

from lxml import etree

html = """
<figure>
  <img src="pic_trulli.jpg" alt="Trulli" style="width:100%">
  <figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>
"""
parser = etree.HTMLParser(recover=False)
content = etree.fromstring(html, parser)

Here is the error: ERROR InvalidHTML (html/w2_s6_u1_c1.html): Tag figure invalid, line 11, column 54

We should make it silent until there is a better approach to this problem. We can also convert that to Warning for instance (Studio does not break with such invalid html tags). FYI -- @jolyonb

jolyonb commented 3 years ago

That's really weird. I thought the html parser was just parsing syntax, not also parsing for whether or not the tags were valid HTML tags. I wonder if there's a setting in the HTMLParser for this...

jolyonb commented 3 years ago

Perhaps we should upgrade to the HTML5Parser? See: https://lxml.de/html5parser.html

awaisdar001 commented 3 years ago

Perhaps we should upgrade to the HTML5Parser? See: https://lxml.de/html5parser.html

Good call !!! I will take a look at this, but looking impressive at a glance.

awaisdar001 commented 3 years ago

Taken another look, can you add how this can be used similarly as etree.HTMLParser? The current implementation is as follows.

try:
    parser = etree.HTMLParser(recover=False)
    content = etree.fromstring(html, parser)
except:
    errorstore.add_error(InvalidHTML(new_file, error=e.args[0]))
jolyonb commented 3 years ago

I think something like the following:

Install library: pip install html5lib

Code:

from lxml.html import html5parser

html = """
<figure>
  <img src="pic_trulli.jpg" alt="Trulli" style="width:100%">
  <figcaption>Fig.1 - Trulli, Puglia, Italy.</figcaption>
</figure>
"""

parser = html5parser.HTMLParser(strict=True)
content = html5parser.fragment_fromstring(html.encode('utf-8'), parser=parser)

Not sure that it gives quite the same quality of error messages back that lxml was giving though :-(

awaisdar001 commented 3 years ago

Interesting, It does give errors I saw but we would have to gather information if it would be useful. I am spending some time today on this and will respond with more information. thanks for jumping in @jolyonb :)