risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.88k stars 569 forks source link

Feat Request: HTTP polling source #16025

Open stdrc opened 6 months ago

stdrc commented 6 months ago

Is your feature request related to a problem? Please describe.

A community user requested that in some simple use cases, users may already have a service that provide a web API that allows for polling events, thus it's can be sweet to have HTTP polling source support in RW. In such use cases, setting up CDC or Kafka can be an overkill.

Also, many web apps provide polling APIs, e.g. instant messaging apps. It can be easier to integrate with these APIs if RW directly supports HTTP polling source.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

tabVersion commented 6 months ago

I'm not sure if exposing RW directly as a service is desirable. The internet only provides best-effort guarantees, and we cannot recover previous HTTP requests from the network interface. This could result in data loss, particularly during cluster recovery processes where requests that have already reached the RW cluster may be lost, possibly even after they have been responded to. Additionally, we would need to implement some traffic balancing strategies before each CN receives HTTP requests, which could further complicate our deployment.

stdrc commented 6 months ago

No it's not exposing RW as an HTTP service, it's having a task inside RW to poll an outside HTTP service to get "updates" or "events". I think that's basically how we currently work with Kafka. The outside HTTP service may or may not have some mechanism to set the consuming offset or something, but that can be discussed. Maybe we can just treat such polling source as non-recoverable append-only source, then, what we get is what we get, what we don't get (due to network issue or something) is just non-existence.

xxchan commented 6 months ago

Since we've added MQTT source (https://github.com/risingwavelabs/risingwave/pull/15388), which cannot be rewound and replayed either, so this shouldn't block HTTP polling source.

Exposing RW as an HTTP service (webhook?) is push-based. Considering integrations with more systems, I feel webhook is more widely used than polling sources (I'm not so sure though). But it's implementation will be more different than other sources.

Another common issue is no standard schema (jsonb?)


But it's implementation will be more different than other sources.

Edit: Maybe it's not that different. It just polls from socket. 🤔

Edit again:

Just realized MQTT is also push-based, but the client library provide poll API from EventLoop. It's not a big deal whether the protocol is push or pull. Just add an internal channel can change it. 🤡

stdrc commented 6 months ago

Another common issue is no standard schema (jsonb?)

I think user can specify schema in source definition?

For event payload encoding, a benefit of HTTP polling source is that we can determine the content encoding by Content-Type header in the response.

Also we may need to allow setting request headers in WITH options.

github-actions[bot] commented 4 months ago

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

BugenZhao commented 3 months ago

Just FYI: Below are the web connectors supported by a newly-emerged streaming system: