redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.09k stars 820 forks source link

Using Pub/Sub notifications for Google Cloud Storage #1034

Open mhite opened 2 years ago

mhite commented 2 years ago

I was briefly playing with the GCS input which looks like it behaves like a batch input, exiting when no more data is left to process. It would be nice to have it run in a "monitor" mode, running continuously and detecting new objects to download based upon Pub/Sub notifications..

GCS provides a notification via Pub/Sub which can trigger on storage object events.

https://cloud.google.com/storage/docs/pubsub-notifications

Might there be a way to approximate similar functionality for GCS like we see in the SQS feature[1] in the AWS S3 input?

https://www.benthos.dev/docs/components/inputs/aws_s3#sqs

Short of adding new functionality to the GCS input, is there a clever way of parsing the Pub/Sub notifications, storing the object names encoded in the notifications, and then firing off another pipeline to consume and download? Or is this really better served with an enhancement to the existing GCS input? (I'm thinking it probably is an enhancement, but I am new to Benthos so perhaps there is some awesome approach I am missing!)

Thanks for all your help,

-M

Jeffail commented 2 years ago

Hey @mhite, this is a totally reasonable addition. I've held off from implementing this myself in the past because I'm not too familiar with google services but we already have most of the work from the aws_s3 input ready to port over.

mhite commented 2 years ago

@Jeffail - That's great! Google Cloud can push their logs not just to Pub/Sub (where the existing input seems to work great, btw) but also GCS as small chunk files in NLD-JSON format. When you don't need that log data absolutely immediately but can deal with minor delay, GCS is great and should cost a lot less than Pub/Sub. Adding the Pub/Sub notifications on object create / finalize to trigger download by Benthos definitely creates a nice "middle-ground" where it's not quite streaming and not scheduled/cron batch. (Note that from a cost perspective, yes you are still creating Pub/Sub messages but far fewer since each file created contains a batch of log messages.)