singer-io / getting-started

This repository is a getting started guide to Singer.
https://singer.io
1.25k stars 148 forks source link

Proposal: Webhooks #50

Open awm33 opened 5 years ago

awm33 commented 5 years ago

I'm proposing an additional use case for Singer taps, webhooks. Many APIs / SaaS products offer webhooks as a way to reduce load on their servers and overcome rate limits.

I'm proposing that a new Singer message type be added with the type "WEBHOOK". A WEBHOOK message is consumed by a tap via stdin. A tap would go into webhook mode using the --webhooks option, consuming WEBHOOK messages from stdin and producing SCHEMA, RECORD, and STATE messages for stdout.

Allowing the logic to stay within the same tap as RESTful API tap logic allows for sharing code and keeping a highly related set of code (webhook and REST/SOAP/etc) in the same project.

Having the WEBHOOK message decouples processing webhook messages from receiving HTTP requests. A webserver implemetnation is responbile for handling requests with/without authentication, and optional queuing or dead lettering. A message via stdin also allows for toolchaining, such as placing a queue or recorder in-between.

The WEBHOOK message would contain the following fields:

The webserver implementation can vary, and be easily customized to be incorporated into other systems. A basic reference webserver can be created to give the community some base code and functionality, while individuals can implement webservers in varying complexity. Advanced webserves may be backed by Kafka, dynamically load and cache taps, etc.

The webserver should handle basic authentication, usually checking some sort of token added to the URL or a header. The webserver can optional choose to implement a dead letter queue, which can pass failed messages later if the authentication issue is transient. Passing the full headers, url, and body allows for tap specific passive (always 2xxs) authentication such as Github's webhook security method.

The most basic implementation may be:

singer-webhook-server | tap-shopify -c config.json --webserver

More advanced multi-tap webservers are also possible.

dmosorast commented 5 years ago

To kick off the discussion, I have some quick first thoughts off the top of my head. I like the idea, since webhooks are quite solved by Singer as it stands, but I'm having trouble with where it fits into the Singer paradigm. (Unix-style, one-responsibility, process pipes)

I'm not sure I understand the need to add a mode to the tap in order to support this. A webhook should be a relatively simple thing, with perhaps some data typing, config, etc. much like a tap would have. However, given the short-lived-process nature of the tap -> target relationship vs. a long-running process nature of a web server, this sounds like it would be something other than a tap. I worry about muddying the concepts too much by mixing the two.

In order to keep it simple, I could see something like the structure being [tap|webhook-server] -> target, where maybe the webhook server's implementation has a kind of supervisor that makes sure the target's pipe remains open, or restarts if it dies, etc.

awm33 commented 5 years ago

@dmosorast Great!

The key thing to my proposal is separating the concerns of handling requests and processing the webhooks, by using a key concept in singer, passing messages via stdin/stdout.

I'm not sure I understand the need to add a mode to the tap in order to support this.

Somehow the tap needs to know to open up stdin, rather than begin a sync cycle hitting APIs. I think of --discover and --catalog as trigger modes / commands. Really, I wish singer used command styles like git, such as discover and sync, where webhooks or process-webhooks would be another command / mode.

However, given the short-lived-process nature of the tap -> target relationship vs. a long-running process nature of a web server, this sounds like it would be something other than a tap.

Again, I don't think the webserver should be part of the tap, and webserver specifics should be left up to webserver implementations. The tap is only expected to take in WEBHOOK messages from stdin and produce the common singer messages to stdout. So there is something other than a tap, a third thing reusable across taps, a webhook webserver.

Using the message also makes testing and replaying webhooks easier since it provides a message format for recording request, allowing tooling to be built around that.

A production ready webserver may need to POpen the tap and targets and potentially kill and recycle them in case their memory leaks. It's also not that uncommon to kill and restart long running python processes. Or if someone places a message queue (SQS, Kafka, Google Sub/Pub, etc) in between, they may just run the tap every 10 minutes. I like that the decoupling allows for many options, and different infrastructures, but everyone could easily share tap logic. A development webserver could be created that may be as simple as possible, and not realistic for production use. Sort of like using flask's built-in webserver for dev and gunicorn for prod.

awm33 commented 5 years ago

Another reason for decoupling, imagine using a lambda function. Someone could write an adaptor that turns lambda API Gateway events into WEBHOOK messages and run the tap and target. Lambda would keep it warm for a while and kill it if there's no activity. API target's like Stitch in particular would work well here, since the destination has an underlying queue (Kafka) that would prevent thrashing the destination with many connections (like Redshift).

Point being, I think there's sooo many possibilities for handling the webhook requests themselves, but it all boils down to getting an HTTP request to a tap.

ghost commented 5 years ago

I think this violates the batch/ephemeral nature of the singer tap->target workflow. If the desire is to handle an HTTP request containing a payload, and ultimately route it to a target, then my suggestion would be to do something like Snowplow's batch collector (a tracking pixel stored in an S3 bucket, served by a cloudfront distribution which writes its logs back to S3), then writing a tap to parse the cloudfront log and feed singer messages to stdout.

awm33 commented 5 years ago

@rsmichaeldunn

I think this violates the batch/ephemeral nature of the singer tap->target workflow

Singer is a stream oriented protocol. It calls a series of records a "stream". The name tap even comes from from thinking of data as streams. A tap - "a device consisting of a spout and valve attached to the end of a pipe to control the flow of a fluid". So I wouldn't say Singer has a batch nature, it just tends to be used in batches since it's most often used to pull from APIs.

As far as ephemeral, I think that's just because of the limited use so far. I believe the protocol itself extends beyond small batches. I'm also saying that HTTP request handling and processing the request should be separated, which makes the webhook request just a data structure from stdin to the tap.

You're example is interesting, though I don't think would work for webhooks since AFAIK cloudfront doesn't log POST bodies. But, say it did, the WEBHOOK message would allow you to use that infrastructure for any webhook supporting tap, rather than requiring everyone using that tap to use AWS CloudFront.

mbilling commented 5 years ago

The paradigm shift from batch to stream to realtime stream is interesting.

Assuming the webserver endpoint relies on the sender of the webhook retrying if the tap -> target process fails then it would make sense to do what you describe.

But assuming realtime (that we almost always return status 200) taking full responsibility on a webhook request then we need to add a buffer, hence the snowplow/S3 proposal. I would suggest a set of tap protocol reader/writers for whatever intermediate target you prefer (disk/s3/etc)

singer-io-(destination)-writer singer-io-(source)-reader we need these for node.js, python, ruby etc

S3 could be the first set of source/destination

awm33 commented 5 years ago

@mbilling Without the sender retrying, I would not consider the webhook reliable. Network partitions in particular are an issue there but there's also AWS systems being temporarily unavailable, errors in code, etc. It's pretty common for webhook implementations to retry, and even exponential back off. I think they usually just use SQS. On the receiving side, high availability would probably require a queue in-between.

This issue goes to only protocol / spec changes, the open standard, not specific implementations. Most singer taps (all at this point?) are in python, but they don't have to be. The spec is basic enough that you could create say a Java tap pretty easily, it just needs to emit a few different JSON structures to stdout. Python can easily run in lots of different infrastructure from CRON on a VM to docker. Webservers with high availability usually rely on cloud specific services like ELB and SQS. I'd like to use this new message type through stdin to eliminate locking users of an open standard into a specific cloud technology, or only using a very specific open source stack without the ability to utilize easy-to-use managed services.

mbilling commented 5 years ago

I agree that is has to be cloud provider/language agnostic. Protocol and format only. Collecting from realtime from IoT devices or browsers etc. could be supported too.

awm33 commented 5 years ago

@mbilling Oh def, you could apply the Singer protocol to streaming. You could even do it now, say creating a tap-salesforce-streaming which runs continuously connected to the Salesforce Streaming API, since in that scenario, similar to hitting the REST API, the tap is initiating communications with the source of data. A webhook is the other way around.

Webhooks require the source of data (usually a SaaS provider) to initiate communications on every new data event/message. I think this is unique enough technologically (inbound data events/messages via HTTP requests), but common enough in practice (many SaaS providers offer webhooks), that creating a new message type specially for webhooks makes sense.

awm33 commented 5 years ago

Another thing to consider. Some SaaS providers require a webhooks to be registered programmatically. How would a tap handle that? Use another option like --register-webhooks? How would it know what the URL(s) would be? Pass them or a base URL as an option?

chaholl commented 4 years ago

Interesting idea. Dealing with sender retries would require some internal state to prevent duplicate output. I suspect that'll run into threading problems pretty quickly if the server is under any load. I guess you can deal with that internally within the web server but it adds a whole load of complexity.

What about a message bus tap as a simple resilient solution? It can read from something like RabbitMQ and anything can push the messages in there in the first place. Maybe that's a whole other proposal :)

awm33 commented 4 years ago

@chaholl I think those are implementation details outside of the spec. With the spec, I'm looking to separate the web server and processing of messages. There may be some simplistic reference server created, but a prod implementation would likely use a webserver and queue (SQS, Kafka, RabbitMQ, etc). We could add it to the message spec, but yes, an ID could be used for dedupping. I've typically either 1) if available from the webhook message, use a per webhook event ID 2) if not available from the webhook message, use a UUID when received so idempotency can be maintained downstream.

This is all implementation specific though, there could be many different prod implementations, the spec should be about high level workflow and message standards.

rafalkrupinski commented 2 years ago

why does it need a new message type? What's wrong with a webhook listener emitting a schema configured by the user and the incoming messages, after validation, as records?