make batch_size count bytes, not messages

reinh / statsd

A Ruby Statsd client that isn't a direct port of the Python example code. Because Ruby isn't Python.

MIT License

411 stars 154 forks source link

Closed ccutrer closed 7 years ago

ccutrer commented 7 years ago

otherwise it's impossible to properly fill a UDP packet, without risking going over, due to varying lengths of messages

raggi commented 7 years ago

The underlying proposal herein is certainly a useful change.

I have a few concerns as proposed, which prevent me from immediately merging:

Altering the public API in an incompatible way is not a good idea. It would be better to introduce a batch_size_bytes for this, and add some logic that appropriately coalesces the two values (for example, send on whichever comes first).
The chosen size assuming ethernet seems reasonable, but I'm not sure how much value there is in worrying about fragmentation as a boundary. ipv4 will carry 64k without a problem on a reasonably reliable network, and most networks on which statsd is deployed are reasonably reliable.
I don't think the method of backlog storage is ideal, that is the change to a string. I think we'd be better off keeping a count.
The introduction of those dups is a separate concern from this change and deserves a separate discussion. For almost all users this is not necessary.

ccutrer commented 7 years ago

Yeah, I wasn't sure what to do there. Maintaining two different batch size params and attempting to choose between them seems bug-prone and surprising to users. So I just left it using the same param, and assumed you would do a major version bump to point out the break API change. I can do whatever you would prefer, though.
It seems far preferable to me to allow someone with a reliable network to increase the size, rather than default to something that will possibly break (mostly silently) for anyone that doesn't fit your ideal world.
Sure, no problem, I can change that.
The dup is only in spec code, not code that end users will run. And it will go away by switching back to keeping an array of backlogged messages that is combined/serialized at send time, rather than passing the actual backlog through to the fake socket, which is then cleared by the Batch object.

ccutrer commented 7 years ago

I've pushed up an amended commit that addresses 3 and 4.

raggi commented 7 years ago

Thank you :)

re.

I should think if @backlog.size > @batch_size || @backlog_bytes > @batch_bytes_size should be a fairly easy to maintain condition pair. I think I'd like for us to go this route, as it saves any user surprises. The other alternative is to introduce a separate batch class, but I don't think we're at sufficient complexity to warrant that yet. We can pick an arbitrarily large value for the default @batch_bytes_size.
👏 🉐 I parry that if we're not considering ideal worlds then we should do as DNS does, and pick 512 bytes, making space for swolen ip headers. ⚔️

I don't think we'll reach a perfect answer for 2, but let me know what you think.

ccutrer commented 7 years ago

okay, I went that way, but allowing you to set one or the other limiting params to nil to ignore it.
I cleverly sidestep your parry by leaving the old defaults intact - 10 messages, and byte size batching disabled :)

raggi commented 7 years ago

Thank you for your thoughtful responses, patience and contribution!

raggi commented 7 years ago

Released, too!