Closed jszwedko closed 2 years ago
I've been plugging away at this issue on nights and weekends and I've reached an impasse - maybe one of several 😅
Splitting the data by a byte length is easy, but it ignores unicode boundaries.
Splitting the data by byte length, but attempting to backtrack to the nearest code point boundary generally works in preserving unicode boundaries, but does not respect grapheme cluster boundaries. Also, possibly only a problem in test code, but if the chunk size is smaller than 4 bytes, it's ambiguous what should happen to a char
since char
s are 4 bytes in length. In short, respecting human recognizable character boundaries is a tough nut to crack if the chunk size is given in bytes.
If we split the data on grapheme clusters, my initial reaction would be to pull in the unicode-segmentation crate (https://crates.io/crates/unicode-segmentation). On the other hand, I also don't think grapheme clusters have any limit on their length in bytes (see this "grapheme cluster" which is 69 bytes in length and yet somehow a single grapheme cluster: "ௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌ" which I believe is a case of "repeatedly adding ZERO WIDTH JOINER with a joined character"). Fun note, this "character" is also breaking VS Code's word wrapping algorithm.
Obviously a single grapheme cluster larger than 1MB in length would be pathological, but it does make me wonder:
I will admit I still don't have much insight into Vector, the product, so I'm not sure what assumptions can be made going into it.
Less importantly, I also have some implementation questions:
String::from_utf8(bytes)
to check if the bytes up to that point were a valid UTF-8 string. If it wasn't, I would backtrack a certain number of bytes until I reached a byte that was a valid ASCII character or root UTF-8 byte, but my method for fallibly validating that a byte array was valid UTF8 was roughly the same. Is there a better way? I found this blog post, which was helpful, but was unsure of whether to invest the time to pursue it: https://www.fpcomplete.com/blog/2018/07/streaming-utf8-haskell-rust/utf8 = true|false
. Is there an example of a pull request adding a config setting and using it in a stdlib function? Hi @briankung !
Nice investigation on this. Some responses in-line below.
- Are clients primarily interested in specifying chunk size in terms of text boundaries or number of bytes?
In this case, to fit below API limits, bytes. In general the systems Vector sinks interact with limit the number of bytes, rather than characters, for incoming data.
- If they're interested in text boundaries, are they okay with assuming "characters" means unicode code points (at most 4 bytes in length) and not grapheme clusters? I think specifying a chunk size in grapheme clusters would be difficult to implement and likely not performant.
Yeah, that is a good observation for grapheme clusters. I think, to start, we can just focus on unicode code points, but that is a definite drawback that we should document.
- If they're interested in specifying chunk sizes in terms of number of bytes, is it okay to assume that we can disrespect unicode and grapheme boundaries? Or rather, is it okay to treat bytes as just bytes and rely on the clients to piece the data together after it has been split?
We could go this far, for the initial implementation, and just chunk on bytes: again documenting this restriction. My hesitation is that I think UTF-8 is fairly common, but still, it's better than not supporting it all :)
- If we're interested in supporting chunk size as a number of bytes along with the requirement that the split respect unicode code points, can we enforce a minimum chunk size of 4 bytes such that chunks don't themselves split unicode code points?
I think chunk
could return an error if the chunk size is smaller than 4 bytes and a character with > the chunk size is seen.
- When I was attempting to split the data at the chunk size boundary while respecting code points, I would use
String::from_utf8(bytes)
to check if the bytes up to that point were a valid UTF-8 string. If it wasn't, I would backtrack a certain number of bytes until I reached a byte that was a valid ASCII character or root UTF-8 byte, but my method for fallibly validating that a byte array was valid UTF8 was roughly the same. Is there a better way? I found this blog post, which was helpful, but was unsure of whether to invest the time to pursue it: https://www.fpcomplete.com/blog/2018/07/streaming-utf8-haskell-rust/
That seems workable, but definitely slow. I think opening a PR with that implementation would be a good focus for discussion.
- Is there an established pattern for testing larger amounts of data? I'd like a test case to test specifically that some data is split into. I was considering https://doc.rust-lang.org/std/macro.include_bytes.html, but I was also unsure where to put the bytes in question.
You can find a bunch of miscellaneous test data in tests/data
. I think that'd be reasonable for this. You can just put it wherever in there for now; someday we might try to organize it better.
- Oh, I almost forgot! On Discord you mentioned possibly using a configuration setting
utf8 = true|false
. Is there an example of a pull request adding a config setting and using it in a stdlib function?
https://github.com/vectordotdev/vector/pull/13008/files popped to mind. This is the opposite -- removing a field rather than adding it -- but should hopefully get you pointed in the right direction.
Closed by #13794
A note for the community
Use Cases
A user wants to split up messages so that they fit within the 1 MB per message limitation of the API they are shipping to.
Attempted Solutions
We have
slice
which could be used to do this once, but without a looping mechanism, it wouldn't be possible to split the message more than once.Proposal
A new
chunk
function that takes a string and a length (in bytes) and splits the message into multiple pieces so that they all fit in the limit.References
No response
Version
vector 0.22.2