radkovo / CSSBox

CSSBox is an (X)HTML/CSS rendering engine written in pure Java. Its primary purpose is to provide a complete information about the rendered page suitable for further processing. However, it also allows displaying the rendered document.
http://cssbox.sourceforge.net/
GNU Lesser General Public License v3.0
234 stars 76 forks source link

Text containing &lt; or &gt; are decoded to < or > symbols when parsed #71

Open GovardhanNag opened 2 years ago

GovardhanNag commented 2 years ago

Hi @radkovo ,

We are using CSSBox DOM parser for parsing the HTML source, here is the implementation:

try (DocumentSource docSource = new StreamDocumentSource(JAFIOUtils.toInputStream(htmlSource), null, "text/html;charset=UTF-8")) { LOGGER.error("Before parse "+htmlSource); // Parse the input document DOMSource parser = new DefaultDOMSource(docSource); Document doc = parser.parse(); LOGGER.error("After parse "+doc.getFirstChild().getTextContent()); }

For example lets consider the input source or htmlSource is <style></style>Test User &lt;test.user@test.com&gt; After parsing the output will be Test User <test.user@test.com>.

Here the text content which contains email field enclosed with &lt; and &gt; are decoded to < and >, but as per our requirement, the parser should not decode &lt; and &gt; to < and >.

How to retain the text as it is without decoding or encoding text in this case, @radkovo could you please provide the solution for this issue?

GovardhanNag commented 2 years ago

Hi @radkovo , Could you please provide any solution to the issue - https://github.com/radkovo/CSSBox/issues/71#issue-1079810169 Thanks in advance.