vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.59k stars 1.55k forks source link

Add `chunk` function to split up text into multiple pieces by size #13329

Closed jszwedko closed 2 years ago

jszwedko commented 2 years ago

A note for the community

Use Cases

A user wants to split up messages so that they fit within the 1 MB per message limitation of the API they are shipping to.

Attempted Solutions

We have slice which could be used to do this once, but without a looping mechanism, it wouldn't be possible to split the message more than once.

Proposal

A new chunk function that takes a string and a length (in bytes) and splits the message into multiple pieces so that they all fit in the limit.

References

No response

Version

vector 0.22.2

briankung commented 2 years ago

I've been plugging away at this issue on nights and weekends and I've reached an impasse - maybe one of several 😅

Requirements questions

...and/or trade-offs to consider

Splitting the data by a byte length is easy, but it ignores unicode boundaries.

Splitting the data by byte length, but attempting to backtrack to the nearest code point boundary generally works in preserving unicode boundaries, but does not respect grapheme cluster boundaries. Also, possibly only a problem in test code, but if the chunk size is smaller than 4 bytes, it's ambiguous what should happen to a char since chars are 4 bytes in length. In short, respecting human recognizable character boundaries is a tough nut to crack if the chunk size is given in bytes.

If we split the data on grapheme clusters, my initial reaction would be to pull in the unicode-segmentation crate (https://crates.io/crates/unicode-segmentation). On the other hand, I also don't think grapheme clusters have any limit on their length in bytes (see this "grapheme cluster" which is 69 bytes in length and yet somehow a single grapheme cluster: "ௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌௌ" which I believe is a case of "repeatedly adding ZERO WIDTH JOINER with a joined character"). Fun note, this "character" is also breaking VS Code's word wrapping algorithm.

Obviously a single grapheme cluster larger than 1MB in length would be pathological, but it does make me wonder:

  1. Are clients primarily interested in specifying chunk size in terms of text boundaries or number of bytes?
  2. If they're interested in text boundaries, are they okay with assuming "characters" means unicode code points (at most 4 bytes in length) and not grapheme clusters? I think specifying a chunk size in grapheme clusters would be difficult to implement and likely not performant.
  3. If they're interested in specifying chunk sizes in terms of number of bytes, is it okay to assume that we can disrespect unicode and grapheme boundaries? Or rather, is it okay to treat bytes as just bytes and rely on the clients to piece the data together after it has been split?
  4. If we're interested in supporting chunk size as a number of bytes along with the requirement that the split respect unicode code points, can we enforce a minimum chunk size of 4 bytes such that chunks don't themselves split unicode code points?

I will admit I still don't have much insight into Vector, the product, so I'm not sure what assumptions can be made going into it.

Implementation / general Vector questions

Less importantly, I also have some implementation questions:

  1. When I was attempting to split the data at the chunk size boundary while respecting code points, I would use String::from_utf8(bytes) to check if the bytes up to that point were a valid UTF-8 string. If it wasn't, I would backtrack a certain number of bytes until I reached a byte that was a valid ASCII character or root UTF-8 byte, but my method for fallibly validating that a byte array was valid UTF8 was roughly the same. Is there a better way? I found this blog post, which was helpful, but was unsure of whether to invest the time to pursue it: https://www.fpcomplete.com/blog/2018/07/streaming-utf8-haskell-rust/
  2. Is there an established pattern for testing larger amounts of data? I'd like a test case to test specifically that some data is split into. I was considering https://doc.rust-lang.org/std/macro.include_bytes.html, but I was also unsure where to put the bytes in question.
  3. Oh, I almost forgot! On Discord you mentioned possibly using a configuration setting utf8 = true|false. Is there an example of a pull request adding a config setting and using it in a stdlib function?
jszwedko commented 2 years ago

Hi @briankung !

Nice investigation on this. Some responses in-line below.

  • Are clients primarily interested in specifying chunk size in terms of text boundaries or number of bytes?

In this case, to fit below API limits, bytes. In general the systems Vector sinks interact with limit the number of bytes, rather than characters, for incoming data.

  • If they're interested in text boundaries, are they okay with assuming "characters" means unicode code points (at most 4 bytes in length) and not grapheme clusters? I think specifying a chunk size in grapheme clusters would be difficult to implement and likely not performant.

Yeah, that is a good observation for grapheme clusters. I think, to start, we can just focus on unicode code points, but that is a definite drawback that we should document.

  • If they're interested in specifying chunk sizes in terms of number of bytes, is it okay to assume that we can disrespect unicode and grapheme boundaries? Or rather, is it okay to treat bytes as just bytes and rely on the clients to piece the data together after it has been split?

We could go this far, for the initial implementation, and just chunk on bytes: again documenting this restriction. My hesitation is that I think UTF-8 is fairly common, but still, it's better than not supporting it all :)

  • If we're interested in supporting chunk size as a number of bytes along with the requirement that the split respect unicode code points, can we enforce a minimum chunk size of 4 bytes such that chunks don't themselves split unicode code points?

I think chunk could return an error if the chunk size is smaller than 4 bytes and a character with > the chunk size is seen.

  • When I was attempting to split the data at the chunk size boundary while respecting code points, I would use String::from_utf8(bytes) to check if the bytes up to that point were a valid UTF-8 string. If it wasn't, I would backtrack a certain number of bytes until I reached a byte that was a valid ASCII character or root UTF-8 byte, but my method for fallibly validating that a byte array was valid UTF8 was roughly the same. Is there a better way? I found this blog post, which was helpful, but was unsure of whether to invest the time to pursue it: https://www.fpcomplete.com/blog/2018/07/streaming-utf8-haskell-rust/

That seems workable, but definitely slow. I think opening a PR with that implementation would be a good focus for discussion.

  • Is there an established pattern for testing larger amounts of data? I'd like a test case to test specifically that some data is split into. I was considering https://doc.rust-lang.org/std/macro.include_bytes.html, but I was also unsure where to put the bytes in question.

You can find a bunch of miscellaneous test data in tests/data. I think that'd be reasonable for this. You can just put it wherever in there for now; someday we might try to organize it better.

  • Oh, I almost forgot! On Discord you mentioned possibly using a configuration setting utf8 = true|false. Is there an example of a pull request adding a config setting and using it in a stdlib function?

https://github.com/vectordotdev/vector/pull/13008/files popped to mind. This is the opposite -- removing a field rather than adding it -- but should hopefully get you pointed in the right direction.

jszwedko commented 2 years ago

Closed by #13794