philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

Floki mistakes &adxnnl=1; in attributes for character code and crashes #237

Closed umuro closed 4 years ago

umuro commented 4 years ago

Description

attribute contents crash Floki when they contain & and later ;. For example "&adxnnl=1;" .The tag example below is from https//:nytimes.com

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Floki should not mistake it for a special character whenever there is an & around... In some sites, query strings are used as attribute contents. Query strings mean a lot of &'s around.

Workaround

Before Floki.parse get rid of the annoying pattern

 ~r/&(?=[[:alnum:]]+=.+;)/ |> Regex.replace(string, "\+\*\+\*")

It's easy to revert this also. But a lot of precious CPU time is lost

philss commented 4 years ago

@umuro Thank you for opening the issue! It was fixed in version 0.23.1. Can you try again with that version?