#5 #12 Binary Codec - Githubissues

original-brownbear commented 7 years ago

Fixes #5 and #12

Long story short, the problem of using the Avro codec with Kafka is, that the interface or the codec is String for the input and output type.

The Ruby implementation of Avro is setup to work with the chars that Ruby strings may contain, but once these strings through Kafka's internal Java/Scala calls:

String.getBytes("UTF-8")
new String(someByteArray, "UTF-8")

Bytes may be stripped since whatever Ruby Avro produces isn't necessarily UTF-8.

=> Fixed by wrapping it all in base64 ... not pretty but compatible with Strings in Kafka. => I suggest to move this Java as discussed elsewhere next to more than make up for the performance hit we take here.

suyograo commented 7 years ago

@original-brownbear I like this approach, but I wonder how we can handle existing data which has not been base64 encoded. One possibility is to make it a major version bump and then ask users to drain their Kafka queue before upgrading. Another possibility is to detect if the data is plain string or base64 encoded and have a separate path for them until new data is written.

Thoughts?

original-brownbear commented 7 years ago

@suyograo

Let's see :)

One possibility is to make it a major version bump and then ask users to drain their Kafka queue before upgrading.

That's very easy for us, but may be incredibly painful for users in a production system. Imagine having to basically stop all producers, drain and your system being unusable until drained. I mean it's somewhat straight forward for some, but I feel like this is just asking for support trouble right?

Another possibility is to detect if the data is plain string or base54 encoded and have a separate path for them until new data is written.

I like this idea better (though I admit it's dirty), but it's in fact very trivial to recognize if something is an Avro packet or base64 encoded. I'd have to look it up, but I think this is in the first 4 bytes or so already so we'd be somewhat transparent here. So I guess we could simply make a minor version bump, have it recognize what kind of encoding the data is in and deprecate that path eventually (which could be somewhat further in the future, since this is trivial and barely impacting performance ... likely not even noticeable).

=> I'm for staying backwards compatible here :)

suyograo commented 7 years ago

I like this idea better (though I admit it's dirty), but it's in fact very trivial to recognize if something is an Avro packet or base64 encoded. I'd have to look it up, but I think this is in the first 4 bytes or so already so we'd be somewhat transparent here.

=> I'm for staying backwards compatible here :)

Makes sense, I'm +1 on this as well to keep it backward compatible.

original-brownbear commented 7 years ago

alright, will be handled tomorrow :)

original-brownbear commented 7 years ago

@suyograo I feel legitimately bad for using rescue for logic here, but f1ae006f44aad782b4b4c5012a54018c684658e6 is probably the by far fastest, given that the system would quickly converge to not containing any non-base64 encoded events right?

suyograo commented 7 years ago

Exception handling is really expensive in JRuby, but I see no other way here to reliably do a fallback, so I'm ok. The performance drag from this will be only until the old data is fully consumed, so we should be ok to live with this. When new data comes in encoded as Base64, all will be well with the world :)

LGTM

suyograo commented 7 years ago

@original-brownbear oh, the usual dance of bump the version (to a minor, not patch) and changelog entry and we're good to merge and publish.

elasticsearch-bot commented 7 years ago

Armin Braun merged this into the following branches!

Branch	Commits
master	91fab7b95a9803b3ee670b517e13b3478c5fe893, 9d94fe77b92452e2b219a2a18a2ef63bf7c8a5cc, 2217a39512a779f9fe11af6537012042fc08041a, e1c00c0876e99e4919bd06eee6094601c7bb6356

original-brownbear commented 7 years ago

@suyograo there we go https://rubygems.org/gems/logstash-codec-avro/versions/3.2.0-java :)

logstash-plugins / logstash-codec-avro

#5 #12 Binary Codec #24