S3 and GCS Connectors - Githubissues

mavam commented 12 months ago

The S3 Connector would make it possible to write to an S3 bucket, e.g.,

read zeek-tsv | write parquet to s3 <args>

There's also the dual of reading:

from s3 read json | ...

### Definition of Done
- [x] Identify desired `<args>`
- [x] Find out best way to integrate (e.g., Arrow comes with S3)
- [x] Implement

dominiklohmann commented 11 months ago

I've done some research on this and think that Arrow is our best bet here.

We should integrate this through the S3 file system abstraction bundled with Arrow. The API supports all major ways to connect to S3 out of the box, which we can abstract away in an S3 plugin in a separate .config/tenzir/plugin/s3.yaml configuration file.

We already have a mechanism for converting between our chunk and Arrow's equivalent arrow::Buffer in both directions without copying, which makes integration with the input and output streams straightforward.

As far as options go, I suggest to support the following:

A URI for the resource to access as a positional argument.
Options for authenticating with S3:
- If not provided, use arrow::S3Options::Defaults(), which already makes use of standard AWS environment variables and/or configuration files.
- Provide an opt-in for anonymous access via arrow::S3Options::Anonymous().
- Provide an opt-in for access-key based authentication via arrow::S3Options::FromAccessKey(), where an access key and a secret key must be provided and an optional session token may be provided.
- Provide an opt-in for assume-role based authentication via arrow::S3Options::FromAssumeRole() and arrow::S3Options::FromAssumeRoleWithWebIdentity().
For writing we need an additional --append flag.

mavam commented 11 months ago

+1 from a Discord user request:

can tenzir read in from s3? I'd like to read in all my historical data

rdettai commented 10 months ago

Provide an opt-in for access-key based authentication via arrow::S3Options::FromAccessKey(), where an access key and a secret key must be provided and an optional session token may be provided.

Provide an opt-in for assume-role based authentication via arrow::S3Options::FromAssumeRole() and arrow::S3Options::FromAssumeRoleWithWebIdentity().

I think these two are not really necessary as the default sourcing for credentials is already very flexible and AWS users are familiar with it. I would argue for starting simple and only adding these if requested by users.

dominiklohmann commented 10 months ago

For args, we want to support --anonymous for both, and (iff easily possible) --append for writing.

tenzir / public-roadmap

S3 and GCS Connectors #69