jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.94k stars 2.19k forks source link

Feature Request: Option to Disable Tree Validation in HtmlTreeBuilder #600

Closed robliao closed 3 years ago

robliao commented 9 years ago

HtmlTreeBuilder with HtmlTreeBuilderState performs validation of the tree as it is getting parsed. For example, it restricts the elements available in a table element opting to reparent those into a foster parent when encountered (Source Link).

Is it within the bounds of jsoup to provide an option to parse the HTML as is without this reparenting feature?

I should also note that I would like to keep the semantics of HTML Parsing (e.g. data nodes like script element contents are not parsed). This requirement prevents me from using the XmlTreeBuilder.

jhy commented 9 years ago

What's the use case?

robliao commented 9 years ago

I'm working on making the Polymer platform (Link) workable with the Google Closure Compiler with the Polymer Renamer (Link).

Polymer's template element provides a way to repeat sections of HTML (Link).

<template is="dom-repeat" items="{{employees}}">
  <div># <span>{{index}}</span></div>
  <div>First name: <span>{{item.first}}</span></div>
  <div>Last name: <span>{{item.last}}</span></div>
</template>

This may fall within a table element.

<table>  
  <template is="dom-repeat" items="{{employees}}">
    <tr>
      <td># <span>{{index}}</span></td>
      <td>First name: <span>{{item.first}}</span></td>
      <td>Last name: <span>{{item.last}}</span></td>
    </tr>
  </template>
</table>

If this template falls within a table element, Jsoup reparents that to the element containing the table, breaking the template. We'd like to keep this template element where it is.

robliao commented 9 years ago

Any update to this? Because we need data node semantics for script and style nodes, there's no available workaround we can use.

jhy commented 9 years ago

I don't have an update. I understand the use case, thanks.

venicia commented 9 years ago

Hi, we are also needing this feature in our project. We are using Polymer template tag to iterate data in an array inside the table element. Any chance the option to disable validation be implemented in November?

If the option to disable validation is not feasible asap, can the element be allowed inside the table?

martijneken commented 8 years ago

+1

Alternatively, you might be able to use the XmlTreeBuilder if you could skip the contents of opaque HTML nodes like script tags.

ericguzman commented 8 years ago

+1 My (admittedly borderline-invalid) table markup is also getting mangled by this parser.

jhy commented 8 years ago

Can't you use the XML parser if you don't want HTML? Per @martijneken's point.

robliao commented 8 years ago

@jhy : Critical to @martijneken's point is the ability to skip opaque HTML tags like script. This was also pointed out in my original post:

I should also note that I would like to keep the semantics of HTML Parsing (e.g. data nodes like script element contents are not parsed). This requirement prevents me from using the XmlTreeBuilder.

Without this, XmlTreeBuilder will simply parse the contents of the elements, which is especially undesirable if HTML tags exist as strings between script tags.

jhy commented 8 years ago

Gotcha, thanks. Sorry, should have read the full report again.

gar1t commented 7 years ago

Has anyone in this thread worked around this issue? I'll proceed down the road to subclassing the default HTML tree builder, but if there's a simpler approach, even if hacky, I'm very happy to take the lazy way out!

qqilihq commented 6 years ago

@gar1t Did you or anyone else come up with any solution?

jhy commented 3 years ago

Will close this -- the html tokeniser state and the html tree builder are pretty tightly coupled, due to the nature of the HTML5 spec, so IMV it's not a feasible change -- and I haven't seen PRs for it either.

I think the right solution for the presented use case is to add support for template elements, per the spec.

mrdziuban commented 1 year ago

I'm way late to the party here, but I recently wrote a test to validate some HTML by ensuring no elements were reparented by jsoup. In the process I came up with a solution to this:

package org.jsoup.parser;

public class NoFosterInsertsHtmlTreeBuilder extends HtmlTreeBuilder {
  @Override
  protected boolean process(Token token) {
    setFosterInserts(false);
    return super.process(token);
  }

  @Override
  boolean process(Token token, HtmlTreeBuilderState state) {
    setFosterInserts(false);
    return super.process(token, state);
  }
}