hyperledger-archives / burrow

https://wiki.hyperledger.org/display/burrow
https://hyperledger.github.io/burrow
Apache License 2.0
1.03k stars 346 forks source link

[Discuss] Log Events to segment.io #437

Closed compleatang closed 7 years ago

compleatang commented 7 years ago

Problem

Currently it is exorbitantly expensive to maintain the analytics surface area most blockchains are trying to manage. Blockchain clients maintain an entire (bespoke) RPC system, then a client library to that RPC.

On top of that consumers of blockchain clients currently expect to be able to interact with the data produced in that blockchain in a very large number of ways (due to the wide variety of use cases currently being built on eris:db).

This, in turn, requires that we maintain libraries like eris-sqlsol an ETL layer which pipes event and collection based data from the chain into a queriable database.

In the future as our user's systems increase in size and complexity the ETL costs are going to rise dramatically. We need to have a cleaner way of scaling here without having to build all of it ourselves.

Solution

Stop trying to build an ETL layer on top of everything else; rather bake in a stronger connection to an industry leading analytics aggregator and use their very connected ETL layer. Such a change will enable customers to route data into systems they already use to analyze business data (or, at a dead minimum, a simple postgres database on premise or wherever they'd like it).

Segment is an advanced data warehousing and integration platform. It is used by many applications to log and coordinate a number of different touch points their business may have. By utilizing an API like segment's we enable our users to do a whole range of interesting things.

Examples of what this would enable

Because segment.io takes much of the ETL trouble out, and solely warehouses back to warehouses controlled by the customer, it would be much more useful just to integrate with a service like them than to have to build our own ETL layer.

What to log?

By default: blocks and transactions By selection: a json file similar in schema to what sqlsol uses that defines:

Notes

In theory this would play nicely with how @silasdavis has designed and built the logging system.

In a first run should we add the feature, exposing all blocks and transactions would comport with what Ben and Tyler agreed would be needed to do sqlsol better anyway.

This would be a very low cost way to increase what users can do with their chains dramatically.

This would be a feature that (I suspect) would typically be turned on via flag or env var rather than written in a config file.

There are a variety of analytics systems that ETL & route, but segment is what we're using for our internal systems; and from all my research is the most centrally placed so as to give us the most benefit for the least cost.

VoR0220 commented 7 years ago

I think this is an excellent way to reduce workload and make our lives a lot easier and focus our resources on things we should be focusing on. Great suggestion. ++

AFDudley commented 7 years ago

This, in turn, requires that we maintain libraries like eris-sqlsol an ETL layer which pipes event and collection based data from the chain into a queriable database.

I strongly disagree with this categorization. eris-sqlsol saves web clients from scraping blocks, how would segment.io do that? Do web clients connect to segment.io in this model? a web client facing cache isn't an ETL, but that same interface can be expanded upon to provide ETL access without having to provide or support an additional interface.

This also doesn't replace the needs of operators to "ship log files". This can be exposed by just using the standards Silas is already looking into. I don't see why this needs to change.

I didn't realize anyone was trying to develop ETL support.

compleatang commented 7 years ago

I didn't realize anyone was trying to develop ETL support.

You heard it on a call on Friday for one :stuck_out_tongue:. And ETL is a massive need across a range of users as you've told me multiple times.

I did not characterize this issue as solving the caching problem for all use cases (although in some it would, with a much more hefty db behind than sqlsol's in-memory sqlite provides). in the web client's scenario, they could connect into the postgres, redshift, or bigQuery db however the application maker has things set up if they needed a direct connection to the DB; or the application maker could use sqlsol (which I'm not advocating that we do away with...yet) instead of this feature.

of course application makers can always build their own ETL and caching using solidity events if they have non-standard caching requirements.

a web client facing cache isn't an ETL, but that same interface can be expanded upon to provide ETL access without having to provide or support an additional interface.

I don't understand this argument. sqlsol is currently doing ETL plus DBMS operations. From eris:db's perspective ETL features don't exist. So there is still "another interface" from the stack's perspective. I admit this would be a new interface from eris:db's perspective.

However, I don't see why we would want to expand the ETL interface sqlsol currently provides to be able to push to "bigger" DB's (which I assume we can all agree is a necessity for application makers) when we can utilize industry leading ETL providers who do it every day.

Should we ship this feature and folks find it useful, at that time, I would probably advocate that we do deprecate support for sqlsol as a standalone library. So it wouldn't be an additional interface, but rather movement of a optional interface from outside the service to inside the service.

This feature is orthogonal to "shipping log files", you are right it does not replace the need to ship log files, nor was it intended to.

AFDudley commented 7 years ago

Yeah, so I'm generally disagreeing with this design philosophy. I suggested that the customer might need ETL support on the call... I think moving eris-sqlsol functionality into ErisDB (exposing blockchain data as relational database data) based on a client ask is the correct solution. Once we have a clear client ask we can decide if we want to embed a relational database, use unix sockets, expose a Relational DB port, etc etc. Without a clear use-case, it feels like fishing.

I think there is some semantic confusion here. I think we need an ETL from blockchain to SQL, nothing more. If a customer asks for it, we should be able to provide them SQL read-only endpoints that speak the read-only subset of SQL93.

compleatang commented 7 years ago

I think we're slightly talking past each other. And perhaps I need to refactor the problem statement to reflect this.

Segment does this:

event source -- segment's etl -- integrations (over 100 integrations to highly used systems)
                              \- warehousing (3 integrations: redshift, bigQuery, bring your own postgres)

I don't disagree with you if we are treating it as part of a cache for web client and other (similar) types of interaction that it's sub-optimal design. I agree that it could be worthwhile, given a customer request, to explore exposing an SQL93 API/RPC or other ETL options for caching requirements.

The reason I think we're talking past each other is that I think you are focusing on the primary value in the feature being event - etl - warehouse whereas I think that the primary value is event - etl - integrations. For me, warehousing is a nice to have.

To be clear, the primary first customer of this feature will be LEI Chain's Application. The use case we have is that we want LEI Chain Application to be able to do a few things such as:

My view is that to plug in to all of those systems is a lot of low level code that we can very simply avoid by using an ETL integrator. At least in the first instance, a simple connection to an ETL integration platform like segment is a very cost effective interface to maintain and enables a broad range of eris:db built applications to interact with other business systems (in my view at least).

AFDudley commented 7 years ago

Okay.

AFDudley commented 7 years ago

I can't tell if you're trying to persuade me or not; I'm not persuaded. This seems like a sure-fire quagmire in terms of BFT features, RBAC, and more general separation of concerns. If you're just telling me what you're going to do; this is fine.

My view is that to plug in to all of those systems is a lot of low level code that we can very simply avoid by using an ETL integrator. At least in the first instance, a simple connection to an ETL integration platform like segment is a very cost effective interface to maintain and enables a broad range of eris:db built applications to interact with other business systems (in my view at least).

  1. you're marrying eris:db, to some proprietary system you have no control over.
  2. just exposing SQL is easier, already useful, orders of magnitude boarder adoption, SQL93 isn't changing anytime soon, once you expose it properly you're done, no upstream syncing.

Again, when I see a customer ask, from a customer, I'll be happy to explore with them what their needs are and develop a roadmap to those features. This sort of short-circuiting seems like an obvious architectural dead end to me.

VoR0220 commented 7 years ago

But what if there infrastructure isnt using SQL? What if its NoSQL? Or mongoDB? Or Redis? What about these?

VoR0220 commented 7 years ago

If we were only aiming for SQL I'd be 100% with you on this Rick. But the question is: is that all we are after? Is that the smartest move? Is there value in exposing our platform to multiple traditional db systems?

VoR0220 commented 7 years ago

Although the proprietary control is definitely concerning...perhaps it would be better to not necessarily marry the two but expose them as a service...maybe we could do both and have SQL which we maintain (because thats most popular) and then open this integration as our "all others" service.

compleatang commented 7 years ago

I'm not trying to convince you @AFDudley because I'm not convinced we're talking about the same value proposition.

I fully accept your first point that segment or any of its competitors have proprietary components.

  1. just exposing SQL is easier

Here is where I'm confused. How would exposing SQL make it easier for users to trigger event streams they can route into other business systems that application users and builders utilize? (genuine question because I may be missing something fundamental to your argument)

Unless I'm missing something fundamental, exposing SQL from eris:db would still require middleware to listen for and trigger event streams. Users that wanted to connect those event streams to other business systems would still have to build their listening & routing mechanisms for those events individually per API to their other systems. Whereas they wouldn't using an ETL aggregation platform.

I am not following your argument as to how BFT or RBAC have anything to do with users having a clean ability to pipe events into other systems.

If we were talking about using an ETL aggregator instead of a more complete caching mechanism that provides a queryable interface for datasets then I can see and do not disagree with your concerns. However, I do not see a sink for event stream initiation as being on point with dataset caching. Maybe I'm missing fundamental about your argument though.

AFDudley commented 7 years ago

@VoR0220 SQL is everywhere, it's in every smartphone and some dumb phones, airplanes, cars, etc etc. Postgres can and does do all the things you've mentioned. But your core "what if" point here is valid and why I think we need an ask from a customer to have a serious conversation about these types of features.

silasdavis commented 7 years ago

Does eris-db need to be aware of segment at all? Any ETL tool will be able to load data from a SQL store. If we are agreed that we want to expose that (I think we are) then it seems we would automatically be providing basic support for something like segment. With that in mind:

  1. What first-class support (i.e. beyond a SQL event log), if any, for segment are we proposing here?
  2. Where would this support live?
  3. Does any such support need to be in eris-db?
  4. Does any such support need to be in a service running alongside the eris-db node or is it just an orthogonal collection of tools to help importing into segment?

To frame the discussion a bit, and having discussed with Tyler, I am envisaging for our rawest (least derived) form of event log a telemetry log of point-like events with a timestamp, that spurts out of a hose to interested parties that can infer events with duration and system state by combining these simple events. I think initially we are talking about putting this raw log into a SQL database to make it queryable, and this would be part of eris-db. This simpler, rawer SQL schema would then define a pinhole interface into eris-db. Where possible it seems desirable for more derived analytics to be driven off that, with eris-db in blissful ignorance.

But I have little idea about specific use cases beyond certain basic contract-watching cases @dennismckinnon brought up.

AFDudley commented 7 years ago

Here is where I'm confused. How would exposing SQL make it easier for users to trigger event streams they can route into other business systems that application users and builders utilize?

It's sounding more and more like you folks want to integrate postgres. https://www.postgresql.org/docs/9.1/static/sql-notify.html

Unless I'm missing something fundamental, exposing SQL from eris:db would still require middleware to listen for and trigger event streams.

Agreed. I think this should be more tightly coupled to eris:db than eris-sqlsol currently is, but we are blind men describing an elephant if we don't have a client ask. This very question is what prompted me to start working on a BFT-DB in the first place.

I am not following your argument as to how BFT or RBAC have anything to do with users having a clean ability to pipe events into other systems.

Trust issues (BFT and RBAC) are always core considerations in any blockchain architecture. Can a malicious user break your integration point by spamming a contract? How are permissions on the blockchain side replicated on the "ETL" side?

However, I do not see a sink for event stream initiation as being on point with dataset caching.

Postgres has been handling both of these concerns in production environments for a very long time. Databases do a lot of enterprise business logic... nearly all of it. So whatever we are trying to do, various production databases already do and have been doing since before we were born.

compleatang commented 7 years ago

Good discussion here, but I still think we're all talking at cross-purposes. The starting point and need for this feature is listed above in the LEI Chain section. Functionally I'm neither advocating that we try to rebuild postgres nor would postgres solve what I'm talking about.

So, what am I talking about? I'm trying to find an easier way for users who want to pipe events that an eris:db node knows about into other business systems.

I'm not trying to do this via a database sync, nor am I talking about providing any database dynamics. I fear that my invocation of both sqlsol and ETL may be leading the discussion in a direction different than intended.

So before going any deeper, let's just decide whether the following questions are a yes, and if consensus is no then we should stop discussion close the issue and move on.

Again, this is about event notification. This is not about "what are all the contracts on the chain" or "what block was transaction hash X in".

If we have consensus that the above two questions are yes then we can talk about the specifics of this issue and how we would implement it, if we do not then we should not continue wasting time on this discussion.


@silasdavis in response to your points.

  1. segment and other ETL platforms don't really work on SQL event logs (although I assume they could be made to) but rather on arbitrary "track" events which are sent to a collection webhook. To be clear the feature discussion (for me) is less about the specifics of segment and more about the paradigm of enabling easier integration of outbound streams to an events aggregation platform such as segment, stitch, or zapier.
  2. It could either live natively in eris:db or users could take eris-contracts spin up a watching node and route their events from there. If we did implement in eris:db functionally the feature would do the following: POST events A, B, C to webhook endpoint url Z. Nothing more.
  3. No (see 2) it does not, strictly speaking, have to be in eris:db. I think it would be much cleaner to have event streams be pipe-able directly to arbitrary listening webhook endpoints but given the discussion on this issue perhaps I'm alone in that view.
  4. I'm not sure I follow your question.
silasdavis commented 7 years ago

@compleatang don't worry about 4, what you've said covers it.

I had also thought about streaming events to something like Kafka/Flume. It ought to be straight-froward to write a sink that publishes to the segment tracking API (https://segment.com/docs/sources/server/http/) without eris-db needing to know specifically about Segment. The nice thing about this is it could support many different consumers. I think we ought to be able to support Amazon Kinesis (based on Kafka) too.

compleatang commented 7 years ago

Right. So I will refactor the top comment if we all think the following is a viable, maintainable feature to add to eris:db.

Users should be able to provide erisdb with a webhook URL and a list of events to send to that webhook.

On boot erisdb will read the watch list and when one of the events occurs will send the event (just as we currently do with subscribing websockets) via http POST to the provided end point.`

compleatang commented 7 years ago

Closing this issue and moving it into the RFC system. If the feature is approved via RFC then we can recreate more discrete issues for the work.