libwww-perl / HTML-Parser

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents.
Other
6 stars 13 forks source link

Feature request: Strict mode for decode_entities() #40

Open bschmalhofer opened 10 months ago

bschmalhofer commented 10 months ago

Since 2004 decode_entities() supports the merging of surrogate pairs. See http://rt.cpan.org/Ticket/Display.html?id=7785 . This means that for example �� will be decoded into a single code point. My understanding that this not covered in any spec.

I therefore propose to add a function decode_entities_strict() that does the same as decode_entities() but rejects surrogate pairs.

Attached is a sample script that shows the effect- surrogate_pair.pl.txt

haarg commented 10 months ago

I would like if we could just remove this behavior from decode_entities, but it could be hard to measure the impact.