Closed sonots closed 11 years ago
String#encode
does nothing with same encoding for 1st and 2nd argument (even if :invalid
option specified).
You should write code like below:
text.encode('UTF-8', 'UTF-16BE', :invalid => ...).encode('UTF-16BE', 'UTF-8')
And, no tests here (we cannot find what you fix).
@tagomoris Hi, I added test codes, and it is working! You can verify it with my test codes.
My ruby version is 1.9.3p286
You should add this asserts under assert_nothing_raised
:
emits = d.emits
assert_equal 1, emits.length
assert_equal invalid_utf8, emits[0][2]['message']
This code will success, and invalid utf8 bytes are reserved in field 'message', instead of '?' (what we expected).
@tagomoris done!
FYI, I tested following code.
https://gist.github.com/repeatedly/4961986
And I asked Ruby committer about this problem. He said "This is a bug. Probably, Ruby 2.0 or new 1.9 patch level will fix this problem. So latter case will throw an Exception". You can't use String#encode with same encoding for normalization.
@repeatedly :+1:
Thanks for your information, and then, i cannot merge this request.
Understand. So, now, how should we do? I actually have another branch which replaces the invalid utf8 character with ? here https://github.com/sonots/fluent-plugin-parser/commit/bf5f643ced2b165abcae82be8b68ee7b1e87b3c3 Do you like this?
Or, do you like we do the replacing just before Regexp#match with dup, so that we can remain the invalid byte codes for following fluentd nodes? Let me clarify. Then, I will write another patch and send a pull request.
In popular systems, you know, almost all of input strings are valid utf-8, and very small number of data has invalid utf-8 bytes. For these cases, we should write patch like this:
t, value = nil, nil
if value
begin
t, value = @parser.parse(value)
rescue SomeError
encodings = value.encoding
replaced_value = value.encode(some_different_encoding, encodings, replace_options).encode(encodings)
t, value = @parser.parse(replaced_value)
end
end
You should consider about non-utf-8 byes and original data reservation for reserve_data option.
I'm waiting for another pullreq!
@tagomoris I sent another pull request #3. Could you check it out?
Hi, @tamogoris -san.
I met with an error when fluent-plugin-parser received kinds of malformed unicode character.
One of our applications outputted an error log like
malformed or illegal unicode character in string [? 遥《K], cannot convert to JSON
, so, right, we can not convert the character to JSON.This patch is the simplest way to fix the error. What do you think?