qubole / streamx

kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)
Apache License 2.0
96 stars 54 forks source link

S3 partition file per hourly batch #47

Open panda87 opened 7 years ago

panda87 commented 7 years ago

Hi

I'd like to know if there is an option to write one file per partition which means per hour. For example, if i have 5 workers with 5 tasks, and I run hourly batch, is this plugins would know to aggregate the data to one file per the running batch?

Thanks D.

OneCricketeer commented 6 years ago

You could use the TimeBasedPartitioner and a rotation interval configured for an hour.

However, this is not recommended for large volume topics and the Connector needs to hold an hour worth of data.

Also, why do you need this? Spark, Presto, Pig, Hive, etc can all read multiple files from an upper level s3 path