Closed original-brownbear closed 7 years ago
@original-brownbear I like this approach, but I wonder how we can handle existing data which has not been base64 encoded. One possibility is to make it a major version bump and then ask users to drain their Kafka queue before upgrading. Another possibility is to detect if the data is plain string or base64 encoded and have a separate path for them until new data is written.
Thoughts?
@suyograo
Let's see :)
One possibility is to make it a major version bump and then ask users to drain their Kafka queue before upgrading.
That's very easy for us, but may be incredibly painful for users in a production system. Imagine having to basically stop all producers, drain and your system being unusable until drained. I mean it's somewhat straight forward for some, but I feel like this is just asking for support trouble right?
Another possibility is to detect if the data is plain string or base54 encoded and have a separate path for them until new data is written.
I like this idea better (though I admit it's dirty), but it's in fact very trivial to recognize if something is an Avro packet or base64 encoded. I'd have to look it up, but I think this is in the first 4 bytes or so already so we'd be somewhat transparent here. So I guess we could simply make a minor version bump, have it recognize what kind of encoding the data is in and deprecate that path eventually (which could be somewhat further in the future, since this is trivial and barely impacting performance ... likely not even noticeable).
=> I'm for staying backwards compatible here :)
I like this idea better (though I admit it's dirty), but it's in fact very trivial to recognize if something is an Avro packet or base64 encoded. I'd have to look it up, but I think this is in the first 4 bytes or so already so we'd be somewhat transparent here.
=> I'm for staying backwards compatible here :)
Makes sense, I'm +1 on this as well to keep it backward compatible.
alright, will be handled tomorrow :)
@suyograo I feel legitimately bad for using rescue
for logic here, but f1ae006f44aad782b4b4c5012a54018c684658e6 is probably the by far fastest, given that the system would quickly converge to not containing any non-base64 encoded events right?
Exception handling is really expensive in JRuby, but I see no other way here to reliably do a fallback, so I'm ok. The performance drag from this will be only until the old data is fully consumed, so we should be ok to live with this. When new data comes in encoded as Base64, all will be well with the world :)
LGTM
@original-brownbear oh, the usual dance of bump the version (to a minor, not patch) and changelog entry and we're good to merge and publish.
Armin Braun merged this into the following branches!
Branch | Commits |
---|---|
master | 91fab7b95a9803b3ee670b517e13b3478c5fe893, 9d94fe77b92452e2b219a2a18a2ef63bf7c8a5cc, 2217a39512a779f9fe11af6537012042fc08041a, e1c00c0876e99e4919bd06eee6094601c7bb6356 |
@suyograo there we go https://rubygems.org/gems/logstash-codec-avro/versions/3.2.0-java :)
Fixes #5 and #12
Long story short, the problem of using the Avro codec with Kafka is, that the interface or the codec is
String
for the input and output type.The Ruby implementation of Avro is setup to work with the chars that Ruby strings may contain, but once these strings through Kafka's internal Java/Scala calls:
Bytes may be stripped since whatever Ruby Avro produces isn't necessarily UTF-8.
=> Fixed by wrapping it all in base64 ... not pretty but compatible with Strings in Kafka. => I suggest to move this Java as discussed elsewhere next to more than make up for the performance hit we take here.