tenzir / public-roadmap

The public roadmap of Tenzir
https://docs.tenzir.com/roadmap
4 stars 0 forks source link

Slice Operator #70

Closed dominiklohmann closed 7 months ago

dominiklohmann commented 12 months ago

We want to implement a slice begin:end syntax that cuts events or bytes in an interval $[begin, end)$. A negative index counts from the end rather than from the beginning, which is a syntax familiar to users coming with experience in Python, jq, and many other languages.

Both begin and end can be omitted, and default to the 0 and the size of the input, respectively.

### Definition of Done
- [x] Implement the `slice` operator.
- [x] Re-implement `head N` as `slice :N`.
- [x] Re-implement `tail N` as `slice -N:`.
- [x] Document the `slice` operator.
- [x] ~Consider adding strides (see discussion below).~ rejected for now
jachris commented 12 months ago

Could also make this begin:end:step, while we are at it (e.g., ::2 or 0:-1:2 for every second event).

mavam commented 12 months ago

step is interesting, as it allows "strided" sampling. But that could also be a dedicated sample operator applied downstream, where striding with step size is just one of many ways to sample.

dominiklohmann commented 12 months ago

I think adding stride with start:end:stride à la Python is a natural evolution for this operator because it's a syntax that is familiar to most users already. We also already have an aggregation function named sample (choose one element of a series).

However, please note that there exist multiple ways to implement strides. The naïve implementation creates batches of size one. The less naïve implementation keeps the stride in the series implementation without actually modifying the data, and applies the stride lazily when the data is serialized. We should be mindful of that.

For the scope of this roadmap item we should consider ignoring the stride for now. I've added a task to the roadmap item to consider adding strides.