matthewmueller / x-ray

The next web scraper. See through the <html> noise.
MIT License
5.87k stars 349 forks source link

Decode XML/HTML entities #144

Open rcdiaz opened 8 years ago

rcdiaz commented 8 years ago

I'm trying to retrieve a text that has accents but these appear to me as?, then use iconv to convert Windows-1252 but all my characters with accents are replaced by ý.

My code is this: exports.buscar = function (req, res){ var x = Xray(); x('http://www.google.com', 'div .x7', [{ descripcion: '.tx', precio: '.x11 .pr' }]) (function(err, obj) { var busqueda = JSON.stringify(obj); var utf8String = iconv.decode(busqueda, 'Windows-1252'); res.send(utf8String); });

gnujeremie commented 8 years ago

I'm using the entities package to decode my french characters. Maybe this could help you.

Kikobeats commented 8 years ago

@gnujeremie awesome! Do you think that could be sense integrate the package with the library?

gnujeremie commented 8 years ago

@Kikobeats Yes, I think it could be usefull, at least for many european people :D

Kikobeats commented 8 years ago

Yeah, I think that is a very tyical case and have sense handle it.

@gconnolly do you think that you can convert this into a PR?

gconnolly commented 8 years ago

Sign me up.

gconnolly commented 8 years ago

Hmm... looking into how/where to integrate entities. I have an ideas, but I am struggling in setting up a reproducible case. @rcdiaz, could you provide some sample HTML or a URL to where you are seeing this issue?