robotframework / robotframework

Generic automation framework for acceptance testing and RPA
http://robotframework.org
Apache License 2.0
9.94k stars 2.34k forks source link

Add `Parse HTML` keyword to XML Library #5256

Open damies13 opened 2 weeks ago

damies13 commented 2 weeks ago

I would like to suggest adding a Parse HTML keyword to XML Library.

Why:

Alternatives: None really, I considered creating a HTML library that would basically be a copy of XML Library but it seems like a big duplication of effort, as it would be using the same lxml.etree module anyway.

Workaround: Currently my workaround is to load the html as an element tree using the html parser, then pass the etree object to the XML library keywords, example below:

    VAR     ${sectionid}=   "my_section_id"
    VAR     ${htmlfile}=    "/path/to/file/myfile.html"
    ${html}=    Evaluate    lxml.etree.parse(r'${htmlfile}', lxml.etree.HTMLParser())   modules=lxml.etree
    ${table}=   Get Element     ${html}     //div[@id='${sectionid}']//table
    ${value}=   Get Elements Texts  ${table}    tr/td[1]

Would you like me to work on a PR for this?

pekkaklarck commented 2 weeks ago

Sounds good to me, especially if

Based on your example, the first point above is true at least with lxml. Do you know does the standard ElementTree support HTML? Parse HTML working only when lxml is available would be ok, I think there's at least one such keyword already.

I'm slightly worried would this open a door for future requests related to HTML like being able to use CSS selectors. I believe that would require a new dependency, and at that point it would be better to have an external library either only for HTML or for both HTML and XML

damies13 commented 2 weeks ago

I'm not sure about the css selectors, as I only used xpath for my test, I see your concern though.

Perhaps it would be enough to put a note in the documentation for the Parse HTML on what html features will and won't be supported?

I'll do some research to find out if lxml supports css selectors and come back on that.

pekkaklarck commented 2 weeks ago

My point was that I don't consider CSS selector support that important in an XML library, and my worry was that people could want to turn it into a HTML library. That said, being able to parse HTML as XML would itself be convenient.

I also noticed that lxml has a limited support for CSS selectors. I'm fine it being exposed especially if it works without any new dependency. The main benefit I see is working with classes as something like span.example is very annoying to write properly as an xpath expression. Anyway, that would require a separate issue.