Closed simmhan closed 11 years ago
GML explicitly does not support UTF8 encoding. Anything that cannot be encoded in ISO-8859-1 should be translated to an HTML character escape. Using this method, GML does support most (possibly all) of the Unicode space.
Hmmm, this is going to be a problem if we’re doing twitter data that has a lot of UTF8 characters. Even excel was having problems with it.
Apart from the GML specs (which we’re following loosely), is there anything else in our Java data loading/access pipeline that would break because of UTF8? I’m wondering if we can step around it in GML.
Yogesh Simmhan | mailto:simmhan@usc.edu simmhan@usc.edu | http://ceng.usc.edu/~simmhan ceng.usc.edu/~simmhan | skype skype:simmhan simmhan | cel tel:+15404494770 +1 (540) 449 4770
From: Soonil Nagarkar [mailto:notifications@github.com] Sent: Monday, July 15, 2013 12:04 PM To: usc-cloud/goffish Cc: Yogesh Simmhan Subject: Re: [goffish] Verify support for UTF8 characters in GoFS (#74)
GML explicitly does not support UTF8 encoding. Anything that cannot be encoded in ISO-8859-1 should be translated to an HTML character escape. Using this method, GML does support most (possibly all) of the Unicode space.
— Reply to this email directly or view it on GitHub https://github.com/usc-cloud/goffish/issues/74#issuecomment-20995202 .
ISO-8859-1 can be generated from UTF8 (or UTFwhatever) using something along the lines of "StringEscapeUtils.ESCAPE_HTML4.with(NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE));" Java is already UTF16, so I don't forsee any problems there.
The above assuming that you want to escape both HTML4 and codepoints above ISO-8859-1. If you didn't want HTML4, you could leave that out.
Some of the twitter datasets we will be working with have international characters e.g. turkish, arabic, etc. We need to verify that all end-to-end our GoFS stack supports UTF8 characterset.
For now, change GML format to support UTF8 rather than escaping UTF8.