Closed dbw9580 closed 1 year ago
Sounds reasonable although I'd expect it to have max-bytes, min-bytes, and avg-bytes parameters.
This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase.
You can learn more in the FAQs for the Boxo repo copying/consolidation effort.
When working with some text-based data files, e.g. csv records and JSON streams, I think it would be more meaningful to break on line boundaries, as each of the resulting pieces will by itself a complete dataset and is easier to process without the hassle to otherwise concatenate two parts together. So I think a chunker that recognizes newlines and allows to specify a "min lines" and a "max lines" parameter (like the Rabin chunker) would be very helpful.