sselph / scraper

A scraper for EmulationStation written in Go using hashing
MIT License
448 stars 88 forks source link

Non breaking space ( ) in gamelist.xml #233

Open PhasecoreX opened 5 years ago

PhasecoreX commented 5 years ago

It seems like the scraper is not parsing HTML entity encoded characters properly, at least from screenscraper.fr, and only for some games. For example, I get this in the XML file:

<publisher>Nintendo&amp;nbsp;of&amp;nbsp;America&amp;nbsp;Inc.</publisher>

It looks like the actual ampersand on &nbsp; is being encoded as &amp;, which gives us &amp;nbsp;. In EmulationStation, this all shows up as:

Nintendo&nbsp;of&nbsp;America&nbsp;Inc.

Oddly, it seems to be only for Game Boy games (that I have noticed). An example is Kirby's Dream Land for Game Boy. This issue isn't present in other systems (For example, Kirby 64 - The Crystal Shards for N64 works just fine with spaces). Not sure if this is a scraper problem, or bad data from screenscraper.fr that scraper could potentially clean up.

sselph commented 5 years ago

From my side I think this is probably working as intended. Looking at the data in screenscraper.fr and the way they encode the json data (using php) I think they actually sending something like {"publisher": "Nintendo&nbsp;of&nbsp;America&nbsp;Inc."} when it should be UTF-8, something like {"publisher": "Nintendo\u00a0of\u00a0America\u00a0Inc."} I think the php used to generate the json is encoding the nbsp for html or the literal &nbsp; is in the database.