qubole / streamx

kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)
Apache License 2.0
96 stars 54 forks source link

Do I have to set up HDFS in order to use streamX? #60

Open iShiBin opened 6 years ago

iShiBin commented 6 years ago

I noticed I have to configure the hadoop config files like core-site.xml, hdfs-site.xml to configure S3. And I could not find the mentioned config/hadoop-conf in my installation (Kafka 0.10.2.0). So do I have to use HDFS in order to use this streamX?

What I am trying to do is to transform some messages in JSON format to parquet and then store them in S3.

Using spark could achieve this target but it would require a long-running cluster to do, or I can use the checkpoint to do a per day basic ETL.

OneCricketeer commented 5 years ago

And I could not find the mentioned config/hadoop-conf in my installation (Kafka 0.10.2.0).

Kafka is not a Hadoop project, that is why you will not find it there. You must make this folder on your own. An EMR instance, or other EC2 Hadoop-provisioned machine would have this folder.

So do I have to use HDFS in order to use this streamX?

Not exactly, but you need to use a Hadoop compatibile filesystem (which S3 is).

Since this project uses the Hadoop FileSystem API, you need to just specify the configuration directory with the XML files included.

Using spark could achieve this target but it would require a long-running cluster to do

Kafka Connect consumers also typically are long-running, as part of a cluster / consumer-group.