medialab / sandcrawler

sandcrawler.js - the server-side scraping companion.
http://medialab.github.io/sandcrawler/
GNU Lesser General Public License v3.0
107 stars 12 forks source link

Handle nasty charset polymorphism #178

Open Yomguithereal opened 9 years ago

Yomguithereal commented 9 years ago

@boogheta: do you know this way of indicating the charset?

en_US.iso885915

boogheta commented 9 years ago

mmm it's a mix of locale and encoding I guess?

Yomguithereal commented 9 years ago

I guess so, but I didn't even know this was standard.

eric-brechemier commented 9 years ago

This format is described in POSIX Base Definitions, 8.2 Internationalization Variables:

If the locale value has the form:

language[_territory][.codeset]

it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME are defined to accept an additional field @ modifier, which allows the user to select a specific instance of localization data within a single category (for example, for selecting the dictionary as opposed to the character ordering of data). The syntax for these environment variables is thus defined as:

[language[_territory][.codeset][@modifier]]

Yomguithereal commented 9 years ago

Thanks @eric-brechemier. I guess we'll have to fix the parser's implementation to make this work.