oblac / jodd-lagarto

Java HTML parsers suite.
https://lagarto.jodd.org
BSD 2-Clause "Simplified" License
20 stars 5 forks source link

text outputs modified CharSequence #22

Closed RXminuS closed 2 years ago

RXminuS commented 2 years ago

I just noticed that the CharSequence passed along to text handler has HTML entities unescaped. However this makes it impossible to know the exact byte-span of the text in the source file. Is there a way to disable this behaviour? I know that in JSoup you can set the document output to ASCII.

igr commented 2 years ago

Hi @RXminuS Thanks for reporting! Just please help me to understand correctly:

If the input is:

<html><body>
A&zwnj;B
C&nbsp;D

Jodd encodes entities and emits:

"\nA\u200CB\nC\u00A0D\n"

You had in mind to actually disable the conversion of HTML escaped entities? Is that what you had in mind? So to return the exact text as it is in the input, for the text blocks?

RXminuS commented 2 years ago

Exactly, in my case it was Hello World&reg; emitting Hello World® which means that my character count is off by 4 and I couldn't tell if it had been originally escaped. This is important for me as I'm trying to figure out the exact byte positions in the source HTML. So I would prefer to get the text exactly as it, because I can always unescape it myself if I need that. I would have used Unbescape but Lagarto could also expose this as a utility itself.

igr commented 2 years ago

Would you be able to test the snapshot @RXminuS ? I believe i have the patch ready.

RXminuS commented 2 years ago

Did you publish a snapshot to test?

igr commented 2 years ago

@RXminuS ah yes, sorry... SNAPSHOT is released, (6.0.6) https://lagarto.jodd.org/install

igr commented 2 years ago
final LagartoParserConfig config = new LagartoParserConfig().setDecodeHtmlEntities(false);
final LagartoParser parser2 = new LagartoParser(config, html);
igr commented 2 years ago

Is it working?

igr commented 2 years ago

@RXminuS hey, just wonder if you were able to try it?

RXminuS commented 2 years ago

Sorry, I thought I had posted already. Yes I have and it works like a charm <3

igr commented 2 years ago

Released 6.0.6 :)