Closed mavam closed 9 months ago
I've done some research on this and think that Arrow is our best bet here.
We should integrate this through the S3 file system abstraction bundled with Arrow. The API supports all major ways to connect to S3 out of the box, which we can abstract away in an S3 plugin in a separate .config/tenzir/plugin/s3.yaml
configuration file.
We already have a mechanism for converting between our chunk
and Arrow's equivalent arrow::Buffer
in both directions without copying, which makes integration with the input and output streams straightforward.
As far as options go, I suggest to support the following:
arrow::S3Options::Defaults()
, which already makes use of standard AWS environment variables and/or configuration files.arrow::S3Options::Anonymous()
.arrow::S3Options::FromAccessKey()
, where an access key and a secret key must be provided and an optional session token may be provided.arrow::S3Options::FromAssumeRole()
and arrow::S3Options::FromAssumeRoleWithWebIdentity()
.--append
flag.+1 from a Discord user request:
can tenzir read in from s3? I'd like to read in all my historical data
- Provide an opt-in for access-key based authentication via arrow::S3Options::FromAccessKey(), where an access key and a secret key must be provided and an optional session token may be provided.
- Provide an opt-in for assume-role based authentication via arrow::S3Options::FromAssumeRole() and arrow::S3Options::FromAssumeRoleWithWebIdentity().
I think these two are not really necessary as the default sourcing for credentials is already very flexible and AWS users are familiar with it. I would argue for starting simple and only adding these if requested by users.
For args, we want to support --anonymous
for both, and (iff easily possible) --append
for writing.
The S3 Connector would make it possible to write to an S3 bucket, e.g.,
There's also the dual of reading: