Closed dzcpy closed 3 years ago
cheeriojs/cheerio#866
This seems like a good change (albeit something cheerio should probably be handling).
I do have a security/reliability concern due to the possibility of unescaped SQL being fed into a database call.
It also could lead people depending on the current behavior to push out unescaped entities e.g. “ instead of &lquot; that don't match a page's encoding.
Semantically, that would push this to 3.0.0 ... which would be a shame. Any suggestions? Maybe adding a warning in 2.3.5 (and removing it after 2.3.6). Also, I'll need to update the docs.
OK, some more research. this was a change os parse5, which cheerio uses.
I'm still worried about sql injection. I think I'm going to suggest an option to keep the current behavior. cc: @matthewmueller
@lathropd thanks for taking a look at this!
I think you're right to keep the current behavior and I'd like to keep it as the default since folks have a bunch of x-ray scripts in the wild.
What do you think about adding support to pass options
from x-ray into cheerio?
I've written up how to use the old parser for cheerio here: https://github.com/cheeriojs/cheerio/issues/1198#issuecomment-522335541
Any updates?
Description
When loading an html string, html entities shouldn't be automatically HTML encoded. For example,
The code above outputs
你好
instead of你好
, which doesn't make sense. This issue makes it hard to fetch the real html code from non-latin web pages.Checklist