qubole / streamx

kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)
Apache License 2.0
97 stars 54 forks source link

Strange problem of Parquet files in S3 #56

Open iskohl opened 6 years ago

iskohl commented 6 years ago

I use streamx to sink kafka data to S3 as parquet files, everything is fine, I can observe the logs, which give me the messages, that the parquet fiiles are generated as expected below,

Dec 19, 2017 8:02:35 AM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 29B for [accessuri] BINARY: 1 values, 6B raw, 8B comp, 1 pages, encodings: [[2017-12-19 08:02:35,933] INFO Committed s3.test/topics/colin-forecast/year=2017/month=12/day=19/colin-forecast+0+0000045248+0000045248.parquet for colin-forecast-0 (io.confluent.connect.hdfs.TopicPartitionWriter:638) [2017-12-19 08:02:35,947] INFO Got brand-new compressor [.snappy] (org.apache.hadoop.io.compress.CodecPool:153) [2017-12-19 08:02:35,948] INFO Starting commit and rotation for topic partition colin-forecast-0 with start offsets {year=2017/month=12/day=19=45249} and end offsets {year=2017/month=12/day=19=45249} (io.confluent.connect.hdfs.TopicPartitionWriter:302) [2017-12-19 08:02:35,949] INFO Committed s3.test/topics/colin-forecast/year=2017/month=12/day=19/colin-forecast+0+0000045249+0000045249.parquet for colin-forecast-0 (io.confluent.connect.hdfs.TopicPartitionWriter:638) [2017-12-19 08:02:35,961] INFO Got brand-new compressor [.snappy] (org.apache.hadoop.io.compress.CodecPool:153) [2017-12-19 08:02:35,962] INFO Starting commit and rotation for topic partition colin-forecast-0 with start offsets {year=2017/month=12/day=19=45250} and end offsets {year=2017/month=12/day=19=45250} (io.confluent.connect.hdfs.TopicPartitionWriter:302) [2017-12-19 08:02:35,963] INFO Committed s3.test/topics/colin-forecast/year=2017/month=12/day=19/colin-forecast+0+0000045250+0000045250.parquet for colin-forecast-0 (io.confluent.connect.hdfs.TopicPartitionWriter:638)

But I cannot find the parquet files landed in S3, there is nothing in S3, why? Do I need some configuration at S3 side? Thanks in advanced.