matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.86k stars 350 forks source link

Non-latin text should not be HTML encoded automatically with `@html` selector #350

Closed dzcpy closed 3 years ago

dzcpy commented 5 years ago

Description

When loading an html string, html entities shouldn't be automatically HTML encoded. For example,

const Xray = require('x-ray');
const x = Xray();

x('<div>你好</div>', 'div@html')
  .then(res => console.log(res));

The code above outputs &#x4F60;&#x597D; instead of 你好, which doesn't make sense. This issue makes it hard to fetch the real html code from non-latin web pages.

Checklist

dzcpy commented 5 years ago

cheeriojs/cheerio#866

lathropd commented 5 years ago

This seems like a good change (albeit something cheerio should probably be handling).

I do have a security/reliability concern due to the possibility of unescaped SQL being fed into a database call.

It also could lead people depending on the current behavior to push out unescaped entities e.g. “ instead of &lquot; that don't match a page's encoding.

Semantically, that would push this to 3.0.0 ... which would be a shame. Any suggestions? Maybe adding a warning in 2.3.5 (and removing it after 2.3.6). Also, I'll need to update the docs.

lathropd commented 4 years ago

OK, some more research. this was a change os parse5, which cheerio uses.

I'm still worried about sql injection. I think I'm going to suggest an option to keep the current behavior. cc: @matthewmueller

matthewmueller commented 4 years ago

@lathropd thanks for taking a look at this!

I think you're right to keep the current behavior and I'd like to keep it as the default since folks have a bunch of x-ray scripts in the wild.

What do you think about adding support to pass options from x-ray into cheerio?

I've written up how to use the old parser for cheerio here: https://github.com/cheeriojs/cheerio/issues/1198#issuecomment-522335541

dzcpy commented 4 years ago

Any updates?