Closed RXminuS closed 2 years ago
Hi @RXminuS Thanks for reporting! Just please help me to understand correctly:
If the input is:
<html><body>
A‌B
C D
Jodd encodes entities and emits:
"\nA\u200CB\nC\u00A0D\n"
You had in mind to actually disable the conversion of HTML escaped entities? Is that what you had in mind? So to return the exact text as it is in the input, for the text blocks?
Exactly, in my case it was Hello World®
emitting Hello World®
which means that my character count is off by 4 and I couldn't tell if it had been originally escaped. This is important for me as I'm trying to figure out the exact byte positions in the source HTML. So I would prefer to get the text exactly as it, because I can always unescape it myself if I need that. I would have used Unbescape but Lagarto could also expose this as a utility itself.
Would you be able to test the snapshot @RXminuS ? I believe i have the patch ready.
Did you publish a snapshot to test?
@RXminuS ah yes, sorry... SNAPSHOT is released, (6.0.6) https://lagarto.jodd.org/install
final LagartoParserConfig config = new LagartoParserConfig().setDecodeHtmlEntities(false);
final LagartoParser parser2 = new LagartoParser(config, html);
Is it working?
@RXminuS hey, just wonder if you were able to try it?
Sorry, I thought I had posted already. Yes I have and it works like a charm <3
Released 6.0.6 :)
I just noticed that the CharSequence passed along to text handler has HTML entities unescaped. However this makes it impossible to know the exact byte-span of the text in the source file. Is there a way to disable this behaviour? I know that in JSoup you can set the document output to ASCII.