usc-cloud / goffish

USC GoFFish Graph Analytics Framework
32 stars 11 forks source link

Verify support for UTF8 characters in GoFS #74

Closed simmhan closed 11 years ago

simmhan commented 11 years ago

Some of the twitter datasets we will be working with have international characters e.g. turkish, arabic, etc. We need to verify that all end-to-end our GoFS stack supports UTF8 characterset.

For now, change GML format to support UTF8 rather than escaping UTF8.

sooniln commented 11 years ago

GML explicitly does not support UTF8 encoding. Anything that cannot be encoded in ISO-8859-1 should be translated to an HTML character escape. Using this method, GML does support most (possibly all) of the Unicode space.

simmhan commented 11 years ago

Hmmm, this is going to be a problem if we’re doing twitter data that has a lot of UTF8 characters. Even excel was having problems with it.

Apart from the GML specs (which we’re following loosely), is there anything else in our Java data loading/access pipeline that would break because of UTF8? I’m wondering if we can step around it in GML.


Yogesh Simmhan | mailto:simmhan@usc.edu simmhan@usc.edu | http://ceng.usc.edu/~simmhan ceng.usc.edu/~simmhan | skype skype:simmhan simmhan | cel tel:+15404494770 +1 (540) 449 4770

From: Soonil Nagarkar [mailto:notifications@github.com] Sent: Monday, July 15, 2013 12:04 PM To: usc-cloud/goffish Cc: Yogesh Simmhan Subject: Re: [goffish] Verify support for UTF8 characters in GoFS (#74)

GML explicitly does not support UTF8 encoding. Anything that cannot be encoded in ISO-8859-1 should be translated to an HTML character escape. Using this method, GML does support most (possibly all) of the Unicode space.

— Reply to this email directly or view it on GitHub https://github.com/usc-cloud/goffish/issues/74#issuecomment-20995202 .

sooniln commented 11 years ago

ISO-8859-1 can be generated from UTF8 (or UTFwhatever) using something along the lines of "StringEscapeUtils.ESCAPE_HTML4.with(NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE));" Java is already UTF16, so I don't forsee any problems there.

sooniln commented 11 years ago

The above assuming that you want to escape both HTML4 and codepoints above ISO-8859-1. If you didn't want HTML4, you could leave that out.