Unicode fallbacks for Ruby 1.8

cpence commented 12 years ago

UnicodeUtils doesn't build on Ruby 1.8. This change will give access to good upcase and downcase methods on ActiveSupport, as well as fallbacks.

Same sort of thing we were just hacking on over at latex-decode. This should get citeproc-ruby up and running on Ruby 1.8 -- unicode_utils won't even install, which keeps citeproc-ruby from installing.

inukshuk commented 12 years ago

Does this also work for non-latin characters (e.g., Greek or Cyrillic)? The citeproc test suite contains numerous test cases using non-latin alphabets, so it's really important for up/down casing to work for those. If they do work, that would be awesome because IIRC this was one of the biggest obstacles for 1.8 support.

Let me add that the current citeproc-ruby implementation is in the middle of a major re-write; if you're interested, there's already a functional version which uses the javascript engine and should work on 1.8 (in fact, it's easier to get it to run on 1.8 at the moment) – going forward the citeproc-ruby gem will become a native drop in replacement for the javascript engine.

cpence commented 12 years ago

The ActiveSupport normalizer (implementation details here) claims to support all of Unicode 6.0. It's loading its tables from unicode_tables.dat here, which is a data file generated by this script, which does seem to be getting the proper UnicodeData.txt from here (on unicode.org).

Long and short of it, yes, ActiveSupport seems to support Everything. Of course, if you're getting the fallbacks, all bets are off. (If you'd like, the implementation could be made even more robust by adding the Unicode gem as a fallback, which does run on 1.8, and the Java to_upper/lower_case methods. Let me see if I can do that and add a few commits to this pull request this afternoon.)

I've seen the rewrite on the horizon, but am pulling for citeproc-ruby, since I'd rather not spin up a JS runtime on my production machines. (I'm doing bibliographic formatting on the fly for a web application.)

inukshuk commented 12 years ago

That sounds great. May even be worth to add active support as a dependency straight up to get rid of the 1.8 unicode woes.

I just mentioned the js engine as a caveat, if you spend time working on the current branch, since it is pretty much not maintained. If you just want to work around a few simple problems, just let me know when you're ready and i'll push a gem release for you. If you're thinking of something more complex or are missing features, we should definitely talk to see how your work could help expedite the rewrite.

In any case, your input is much appreciated. Cheers!

----- Reply message ----- From: "Charles Pence" reply@reply.github.com Date: Tue, Jan 31, 2012 9:59 pm Subject: [citeproc-ruby] Unicode fallbacks for Ruby 1.8 (#6) To: "Sylvester Keil" sk@semicolon.at

The ActiveSupport normalizer (implementation details here) claims to support all of Unicode 6.0. It's loading its tables from unicode_tables.dat here, which is a data file generated by this script, which does seem to be getting the proper UnicodeData.txt from here (on unicode.org).

Long and short of it, yes, ActiveSupport seems to support Everything. Of course, if you're getting the fallbacks, all bets are off. (If you'd like, the implementation could be made even more robust by adding the Unicode gem as a fallback, which does run on 1.8, and the Java to_upper/lower_case methods. Let me see if I can do that and add a few commits to this pull request this afternoon.)

I've seen the rewrite on the horizon, but am pulling for citeproc-ruby, since I'd rather not spin up a JS runtime on my production machines. (I'm doing bibliographic formatting on the fly for a web application.)

Reply to this email directly or view it on GitHub: https://github.com/inukshuk/citeproc-ruby/pull/6#issuecomment-3747877

cpence commented 12 years ago

I'm certainly not missing features -- the current version does exactly what I need it to do (format my very specific, and very minimal, CSL snippets into bibliographic entries). Optimal would be to get the current version running "well enough" under Ruby 1.8, as the lack of citation formatting is the only big missing feature in my app when running under 1.8.

I'll bash on it for a little while and see if I can get anything useful together. If not, I'll certainly follow your advice and switch over to the JS edition. I'll ping this pull request again if I have a full patch for 1.8-1.9 compatibility together. (Maybe at least that would be useful for later versions -- I've already fixed, for example, the widespread use of #id and #type, which are reserved words under 1.8.)

cpence commented 12 years ago

Okay, with this set of changes, things run on Ruby 1.8 nearly as well as they do on Ruby 1.9. There's some funny issues that (I think) can be traced to the fact that Hash is unordered on 1.8, while it's ordered on 1.9 -- a few dozen new rspec failures have strange things in different orders than they are expected. There's also some silliness with DatePart on 1.8 that I'm not bothered to track down. But this code does fairly well -- only 153 spec failures, which is around 25 more than 1.9.

Of all of what I've done, here's what I think would be most important for Ruby 1.8 compatibility on your rewrite:

Make sure to default $KCODE to "UTF-8".
You can't use "\uXXXX". The most portable way I've found to specify Unicode string literals is with a UTF-8 sequence of bytes ("\xNN\xNN\xNN") followed by str.force_encoding("UTF-8") on Ruby 1.9. There may be a better way to do this that I don't know, but the \u escapes definitely have to be changed.
Spaceship doesn't work on nil in Ruby 1.8, for no good reason.
If you're going to use #id and #type as class methods, undefine them out of Object on Ruby 1.8. (They were renamed to #object_id and #class in Ruby 1.9.)
And, of course, the Unicode upcase/downcase conversion fallback support should work really well.

Bah. Everybody should just upgrade! =)

But since they haven't, I'd really appreciate a new gem version when you get the chance. This will make CSL-formatted citations work under Ruby 1.8 for the vast majority of use-cases.

inukshuk commented 12 years ago

About \uXXXX strings; you might have noticed that we're working around this in latex-decode by converting strings to unicode like this (1.8 only):

def self.to_unicode (string)
  string.gsub(/\\?u([\da-f]{4})/i) { |m| [$1.to_i(16)].pack('U') }
end

Obviously, this only makes sense if you know exactly which strings contain the codes as you probably wouldn't want to push every string through the method.

inukshuk commented 12 years ago

Landed in 0.0.4

inukshuk / citeproc-ruby

Unicode fallbacks for Ruby 1.8 #6