Add `Parse HTML` keyword to XML Library

damies13 commented 2 weeks ago

I would like to suggest adding a Parse HTML keyword to XML Library.

Why:

I have a need to test the html output from an application that output's a html file to the local file system
I found that when I tried to parse the html file with the Parse XML keyword I get errors because html elements are not valid xml
- <meta> and <img> elements do not have closing elements and fail xml validation
I do not want to install a web browser and browser based library and the necessary dependencies on my test runner just to test the html output, loading the html as an element tree and then using the XML Library keywords allows me to test what is needed.

Alternatives: None really, I considered creating a HTML library that would basically be a copy of XML Library but it seems like a big duplication of effort, as it would be using the same lxml.etree module anyway.

Workaround: Currently my workaround is to load the html as an element tree using the html parser, then pass the etree object to the XML library keywords, example below:

    VAR     ${sectionid}=   "my_section_id"
    VAR     ${htmlfile}=    "/path/to/file/myfile.html"
    ${html}=    Evaluate    lxml.etree.parse(r'${htmlfile}', lxml.etree.HTMLParser())   modules=lxml.etree
    ${table}=   Get Element     ${html}     //div[@id='${sectionid}']//table
    ${value}=   Get Elements Texts  ${table}    tr/td[1]

Would you like me to work on a PR for this?

pekkaklarck commented 2 weeks ago

Sounds good to me, especially if

this can be implemented without extra dependecies
no other HTML specific functionality is needed i.e. xpath is enough for finding elements.

Based on your example, the first point above is true at least with lxml. Do you know does the standard ElementTree support HTML? Parse HTML working only when lxml is available would be ok, I think there's at least one such keyword already.

I'm slightly worried would this open a door for future requests related to HTML like being able to use CSS selectors. I believe that would require a new dependency, and at that point it would be better to have an external library either only for HTML or for both HTML and XML

damies13 commented 2 weeks ago

I'm not sure about the css selectors, as I only used xpath for my test, I see your concern though.

Perhaps it would be enough to put a note in the documentation for the Parse HTML on what html features will and won't be supported?

I'll do some research to find out if lxml supports css selectors and come back on that.

pekkaklarck commented 2 weeks ago

My point was that I don't consider CSS selector support that important in an XML library, and my worry was that people could want to turn it into a HTML library. That said, being able to parse HTML as XML would itself be convenient.

I also noticed that lxml has a limited support for CSS selectors. I'm fine it being exposed especially if it works without any new dependency. The main benefit I see is working with classes as something like span.example is very annoying to write properly as an xpath expression. Anyway, that would require a separate issue.

robotframework / robotframework

Add `Parse HTML` keyword to XML Library #5256