rtomayko / rocco

Rocco is Docco in Ruby
MIT License
404 stars 80 forks source link

Unit test for ISO-8859-1 breaks under Ruby 1.9.3 #73

Open lambda opened 12 years ago

lambda commented 12 years ago

When running the unit tests, test_issue10_utf8_processing fails when I run it under Ruby 1.9.3.

  2) Error:
test_issue10_utf8_processing(RoccoIssueTests):
ArgumentError: invalid byte sequence in UTF-8
    /Users/lambda/src/rocco/lib/rocco.rb:235:in `split'
    /Users/lambda/src/rocco/lib/rocco.rb:235:in `parse'
    /Users/lambda/src/rocco/lib/rocco.rb:134:in `initialize'
    /Users/lambda/src/rocco/test/test_reported_issues.rb:31:in `new'
    /Users/lambda/src/rocco/test/test_reported_issues.rb:31:in `test_issue10_utf8_processing'

This is because Ruby 1.9 is much more strict about encodings, instead of just reading bytes into the string and hoping you know how to interpret it, it actually keeps track of what encoding it expects the file and the string to be and transcodes between them. This means that we can no longer just read in the ISO-8859-1 as binary, and kind of hope that everything later on ignores it or knows how to cope. Instead, we need to explicitly choose to read the file as ISO-8859-1, and explicitly choose the encoding that we want the string to be in (probably UTF-8 as that's what everyone else will be expecting).

yumitsu commented 12 years ago

That's why you need to use # encoding: utf-8 comment at the top of your file to explicitly force the encoding. With this comment at the top of fixtures/issue10.utf-8.rb test pass fine with modified assertion check:

test_issue10_utf8_processing(RoccoIssueTests) [/Users/yumitsu/Work/github/rocco/test/test_reported_issues.rb:26]:
UTF-8 input files ought behave correctly..
<"<p>hello ąćęłńóśźż</p>\n"> expected but was
<"<p>encoding: utf-8\nhello ąćęłńóśźż</p>\n">.

Are you sure we cannot avoid this error with encoding comment? I'm just not sure about redundant checks.

lambda commented 12 years ago

@yumitsu That's not the assertion which was failing. It was the ISO-8859-1 test below that was failing; the UTF-8 test worked, because the default external encoding is UTF-8, but when it tried to read the file encoded in ISO-8859-1 assuming that it was UTF-8, it failed due to an invalid byte sequence.

Encoding comments would be another way to specify this, rather than passing in a new option, though we would need to add code to parse those comments in Rocco (and support for the equivalent comments in other languages, if they exist), they aren't automatically parsed when you read the file using File::read.

yumitsu commented 12 years ago

@lambda Yes, you're right. My bad.