Fix forward error: invalid byte sequence in UTF-8

sonots commented 11 years ago

Hi, @tamogoris -san.

I met with an error when fluent-plugin-parser received kinds of malformed unicode character.

2013-02-14 14:54:04 +0900: forward error: invalid byte sequence in UTF-8
  2013-02-14 14:54:04 +0900: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-parser-0.2.1/lib/fluent/plugin/fixed_parser.rb:21:in `match'
  2013-02-14 14:54:04 +0900: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-parser-0.2.1/lib/fluent/plugin/fixed_parser.rb:21:in `call'
  2013-02-14 14:54:04 +0900: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-parser-0.2.1/lib/fluent/plugin/fixed_parser.rb:235:in `parse'
  2013-02-14 14:54:04 +0900: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-parser-0.2.1/lib/fluent/plugin/out_parser.rb:75:in `block in emit'
  2013-02-14 14:54:04 +0900: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.31/lib/fluent/event.rb:127:in `feed_each'
  2013-02-14 14:54:04 +0900: /usr/lib/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.31/lib/fluent/event.rb:127:in `each'

One of our applications outputted an error log like malformed or illegal unicode character in string [? 遥《K], cannot convert to JSON, so, right, we can not convert the character to JSON.

This patch is the simplest way to fix the error. What do you think?

tagomoris commented 11 years ago

String#encode does nothing with same encoding for 1st and 2nd argument (even if :invalid option specified). You should write code like below:

  text.encode('UTF-8', 'UTF-16BE', :invalid => ...).encode('UTF-16BE', 'UTF-8')

And, no tests here (we cannot find what you fix).

sonots commented 11 years ago

@tagomoris Hi, I added test codes, and it is working! You can verify it with my test codes.

My ruby version is 1.9.3p286

tagomoris commented 11 years ago

You should add this asserts under assert_nothing_raised:

    emits = d.emits
    assert_equal 1, emits.length
    assert_equal invalid_utf8, emits[0][2]['message']

This code will success, and invalid utf8 bytes are reserved in field 'message', instead of '?' (what we expected).

sonots commented 11 years ago

@tagomoris done!

repeatedly commented 11 years ago

FYI, I tested following code.

https://gist.github.com/repeatedly/4961986

And I asked Ruby committer about this problem. He said "This is a bug. Probably, Ruby 2.0 or new 1.9 patch level will fix this problem. So latter case will throw an Exception". You can't use String#encode with same encoding for normalization.

tagomoris commented 11 years ago

@repeatedly :+1:

Thanks for your information, and then, i cannot merge this request.

sonots commented 11 years ago

Understand. So, now, how should we do? I actually have another branch which replaces the invalid utf8 character with ? here https://github.com/sonots/fluent-plugin-parser/commit/bf5f643ced2b165abcae82be8b68ee7b1e87b3c3 Do you like this?

sonots commented 11 years ago

Or, do you like we do the replacing just before Regexp#match with dup, so that we can remain the invalid byte codes for following fluentd nodes? Let me clarify. Then, I will write another patch and send a pull request.

tagomoris commented 11 years ago

In popular systems, you know, almost all of input strings are valid utf-8, and very small number of data has invalid utf-8 bytes. For these cases, we should write patch like this:

t, value = nil, nil
if value
  begin
    t, value = @parser.parse(value)
  rescue SomeError
    encodings = value.encoding
    replaced_value = value.encode(some_different_encoding, encodings, replace_options).encode(encodings)
    t, value = @parser.parse(replaced_value)
  end
end

You should consider about non-utf-8 byes and original data reservation for reserve_data option.

I'm waiting for another pullreq!

sonots commented 11 years ago

@tagomoris I sent another pull request #3. Could you check it out?

tagomoris / fluent-plugin-parser

Fix forward error: invalid byte sequence in UTF-8 #2