risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.07k stars 582 forks source link

Support HDFS sink #17035

Open zhanglistar opened 5 months ago

zhanglistar commented 5 months ago

Is your feature request related to a problem? Please describe.

No. It's a common senario, we use flink as a streaming processor to do ETL jobs saving data to Hive.

Describe the solution you'd like

Support HDFS sink.

Describe alternatives you've considered

Noop.

Additional context

We want to try to use RW insteads flink to save cost, but now RW lacks this feature.

fuyufjh commented 5 months ago

We are now working on file sink, which supports multiple kinds of storage system including AWS S3 and HDFS. cc. @wcy-fdu Can you help to link this issue to that

wcy-fdu commented 5 months ago

Hi @zhanglistar , glad to see you're interested in RisingWave. Sink to the file system is in our roadmap, could you please elaborate on your requirements for HDFS sink? For example, the sink file format/type, whether the file needs to be batched, etc.

zhanglistar commented 5 months ago

@wcy-fdu @fuyufjh Thanks for your reply. The background is that we want to use Risingwave to substitute Apache Flink for lower cost. There are several types of jobs running on Flink, 1) ETL, sink to Hive and data on HDFS, file format is parquet, need to be batched. 2) Java datastream API, this part is hard, And talked with @yingjunwu , no plan. 3) Flink SQL, this part is the simplest. Need to try to find how much resource can be saved from RW. Thanks. If you need more information, just tell me. And we can contribute the community if the thing worth doing.

lmatz commented 5 months ago

need to be batched.

what is the criteria for batching, by the number of rows, or by seconds?

zhanglistar commented 5 months ago

need to be batched.

what is the criteria for batching, by the number of rows, or by seconds?

By seconds.

wcy-fdu commented 5 months ago

We have received your request and will support HDFS sink in the next two to three releases.

zhanglistar commented 5 months ago

@wcy-fdu Do you plan to support Apache Hive sink in the next two to three releases?

wcy-fdu commented 5 months ago

We have no plans for hive sink now, but HDFS sink will be there. Contributions welcome🙌

zhanglistar commented 4 months ago

@wcy-fdu Looking forward to HDFS sink. Thanks a lot. We may add hive sink later.

github-actions[bot] commented 2 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

wcy-fdu commented 1 month ago

Since HDFS is heavily dependent on the Hadoop environment, we have not yet implemented the HDFS sink after discussion. You can use the webhdfs sink as a workaround to sink to HDFS.