snowplow / snowbridge

For replicating streams across clouds, accounts and regions
Other
15 stars 8 forks source link

High performance plug-able extension points #48

Open jrluis opened 3 years ago

jrluis commented 3 years ago

A cheap way to push logs or metrics to a centralized point is to pipe the logs/metrics to stream replicator with a command like:

tail -f /var/log/system.log | ./build/output/darwin/cli/stream-replicator

In this example, are referenced logs and metrics But it doesn't need to be just this kind of information, it can be any information.

For context, each metric in aws costs 0.30$ per month, which is quite expensive as currently we already collect thousands of metrics and the direction is to collect even more.

Currently to add new sources, targets or transformations is needed to change the stream replicator code.

As a stream replicator operator, it would be cool to be able to add new sources, targets or transformations without having to generate a new stream replicator binary.

Solving the log shipping issue could be done by building two plugins, one of the plugins would be an http target the other plugin would be a logs to snowplow event transformation.

The plugins could be built in any language and they would communicate with stream replicator using the operating system named pipes. Using the named pipes is very important as for cross-process communication they are almost as fast as in-process communication. A protocol would have to be designed as well for the stream replicator to communicate with plugins.

This feature as part of a strategy to license/opensource stream replicator to the customer, would also allow to the customer to build extensions to integrate with their inhouse technologies. It would be like a framework that Snowplow would advise and help the customer when the customer builds their own stream consumers. Maybe Snowplow could even manage the deployment of the customer custom consumers.

jbeemster commented 3 years ago

I think technically for the "source" part this is already covered as we have a stdin option which can listen to a named pipe. This would let someone bring their own HTTP server as long as it could write to a named pipe itself already. NGINX or other proxy systems might already be able to solve this part of the equation quite neatly already actually. A formal HTTP source would be nice from a self-containment point of view though.

The transformations are really nice though as it makes it possible to plug any bit of code between the source and target. It would be possible to write quite complex filtering or transformation logic in whatever language is best suited to the job and would only need to implement reading from stdin and writing out to stdout/err as the interface - which also makes testing these bits of code very very simple.