provide GraphQL explorer module for rivine (shorthand: q)

GlenDC commented 5 years ago

As a first phase of working on the big R&D story #605, it would be a nice first step to provide a new module that works completely independent and provides a full query/subscription (so no mutation) scheme for all our queries.

This first phase would allow us to develop a complete scheme to query on our data, play with these queries, experiment with the subscription model (useful for light clients) and test its limits, all without breaking or changing any existing features.

It would also allow, once completed, light clients to already start supporting this new GraphQL endpoint (served still over HTTP(S)) as the phases, also for #605, that follow after this will be changes in the internals of rivine and should not impact the API for the users (see: light clients).

GlenDC commented 5 years ago

For the technology I think https://github.com/graphql-go/graphql will be the way to go.

A very useful resource to bootstrap myself into GraphQL was https://graphql.github.io/learn/

Over the course of this issue I think it would also be useful to try to apply some best practices from https://graphql.github.io/learn/best-practices/:

pagination, json (with gzip), HTTP, nullability (don't go overboard with the enforced non-nulls), batching and caching (there is even a Go implementation of Facebook's Dataloader: https://github.com/graph-gophers/dataloader).

GlenDC commented 5 years ago

Added an initial, still pure theory, attempt in defining a schema, using static queries only for now:

https://github.com/threefoldtech/rivine/blob/3f48bd994ff106292a2c0354ec324ac2dc730e35/modules/explorergraphql/schema.graphql

I think it should be enough to start playing with a first quick implementation, with the goal to as soon as possible be able to have a small web frontend that allows me to play with queries and iterate on this initial attempt as such.

GlenDC commented 5 years ago

https://github.com/99designs/gqlgen looks like it has potential. It allows us to generate the server code directly from the GraphQL schema, allowing us to focus on the business logic and ensuring it is in sync with our API definition.

GlenDC commented 5 years ago

Since https://github.com/threefoldtech/rivine/commit/70e12295de0119cd8f0b52ef0a0637ba99f1fdd0 the first phase of the GraphQL explorer starts to be feature complete, at least on a database level.

Objects can be looked up (wallets, contracts, blocks, transactions and outputs). Information can be requested. All seems to work and resolves in a lazy manner.

What are the next steps?

We need to have some regular meetings @robvanmieghem, @LeeSmet, @DylanVerstraete and anyone who is interested (e.g. @zaibon), starting with a first meeting today, where I can show the current state, we can have a a Q&A and provide feedback to where to go from that moment.
I need to iterate on the API, currently the API is quite complete, but I do not have pagination for big lists yet, there is no filtering yet, and any other advanced query parameters I could add to fields. This I will start doing today.
@LeeSmet will help me with setting up a TFChain std network explorer node that will have the current version of the GraphQL explorer running so we can also test it in a real environment. It will be redeployed each time I make substantial updates. This will be done in parallel with task (2).
GraphQL also supports updates of (web) sockets, called subscriptions. It has been added since 2015 I think, so I would think it is quite established by now (or so I hope). It would be good to provide already 3types of subscriptions: subscription to a wallet, contract or new (latest) block.
There are still "important" improvements that can be done to the current MVP implementation, despite the work already done to it, and than I'm not even talking about supporting extensions yet:
- The chain data is currently already stored as a network of data, linking objects together rather than storing them embedded (e.g. outputs of a transaction are stored separately, and in a transaction these outputs are simply stored by ID). The exception being the inputs, which only store the spenditure information as optional data of an input. This can be improved upon by linking by the already used DataID, an incrementing 64 integer that I use already for storing all objects (wallets, contracts, blocks, transactions, outputs). This DataID could also be used internally to link, and would reduce all things where ID links are made by a factor of 4 (and slightly more for wallet links, e.g. ms wallet references in ss wallets). To achieve this we'll need to make it on that level also slightly more lazy, where identifiers (the real ones) are given as a function instead of a flat structure. But if so we might as well look if it isn't better to make most of those exposed models from the explorer db interface as resolver interfaces instead of data structures. Something to investigate.
  - Having data identifiers as links will already reduce the size of wallets (especially the heavily-used) ones by a lot. And reduce them by a factor of 4 (more or less, used as a limit). These wallets still do pose a big burden on the disk I/O as each time a wallet is to be updated it requires reading the entire wallet and storing it back. This could be resolved in 2 steps:
    - storing the wallet primitive and aggregated (e.g. balance) information can be stored as one object and the object links (coin outputs, blockstake outputs, transactions, blocks) could be stored as 4 different objects, all using that same data ID). This fits in the proposal to turn the explorer db models into resolvers, so these links only require to be resolved in case the they are requested by the user. If on top of that, but this one might be more fantasy than ever possible in storm DB, we could simply append to it directly on disk, instead of having to read and rewrite it completly, that would be great. If that is possible, and we handle our data processing smart, we should be able to reverse just as easily by trimming the last (x) identifiers from the lists directly on disk as well.
  - Currently we already, since yesterday, use a single (RW) transaction per apply consensus change, this already made our (initial) syncing 3 times faster on my machine. The reading is currently still done however using a (R) transaction per call. Not sure how much performance impact that has, but if somehow we could keep it to one transaction per call, that might be nice too. How that would fit in the API provided by the generated GraphQL resolver code is a mystery to me though. Not sure it also matters as much for reading as it does for writing.
Currently the GraphQL explorer does not support unconfirmed data yet (transaction pool data). This can be easily resolved by providing an explorer.DB interface implementation, which subscribes to the Transaction Pool, and wraps another explorer DB. It could then be made so, that if the wrapped DB returns ErrNotFound for an object that could be in the pool theoretically, that it would check its current pool state if that is the case, and if so, return it with an unconfirmed flag to true. All objects that can be unconfirmed would from that moment also have an optional Unconfirmed Bool field. Doing this task is pretty easy, and so trivial that for me it is less urgent, and is only required to be done once we want to start using this module into production.

GlenDC commented 5 years ago

For task (5) I might have an idea. Currently a wallet contains some identification data, aggregated data (such as balance), and mostly linked data (blockids, txnids, coinoutputids, blockstakeoutputids). In the most storage-cheap approach we could instead perhaps link all used wallets in each block (by data ID). This would make the updates and storage very cheap. It would make the querying for outputs, transactions and blocks a lot more expensive though. Blocks would still be kinda OK, as that can be indexed directly, but for transactions we would need to resolve each block transaction ID, and than for outputs we would on top of that need to resolve each output to know if that output is linked to that address. Perhaps that is OK, but perhaps not. A balance that needs to be investigated I guess.

I propose that once we make some time for this task, that we setup some quick stand alone examples that we can benchmark, so we know what gives the best results, in terms of storage as well as update speed and query speed.

GlenDC commented 5 years ago

For Task (3) @LeeSmet tried to setup a std standard tfchain daemon, for which I provided a branch, but we seem to have an issue on the data storage side. It seems to take about 1 GB of disk space per 1000 blocks, not a very good trade-off.

So I am fairly sure that I'll need to look into optimising this heavily starting Monday next week.

GlenDC commented 5 years ago

Schema documentation has also been added. Our playground supports this also by default. If we want it is also possible to use third-party tools to generate documentation from this, should we want to display it in a stand alone way.

Example in our playground:

GlenDC commented 5 years ago

Last two days I was trying to optimise some stuff, both in terms of speed and in terms of storage.

We currenly require, with wallet data included, require about 260MB per 1000 blocks, which is a reduction of factor 4, compared to the initial implementation. Still a lot though.

We also already sync 12% faster (more or less).

Both optimisations have been mostly achieved by using msgpack instead of rivbin for the explorer graphql db storage.

Still a lot of room for optimisation though. Biggest one will be to not rely on the query feature for the unlocking of outputs. Besides that we should try to batch the applied/revert bocks in bigger groups, might also help (but not sure here). For sure not relying on the query feature for unlocking of outputs will help a lot already (or so is a theory of mine).

We're still using StormDB for now. If there are other embedded DB suggestions that I should try to use instead, feel free to hit me up.

DylanVerstraete commented 5 years ago

My findings on Graphql implementations for frontend applications.

Apollo Graphql : https://github.com/apollographql

Which can be used in React and Vue. For the explorer frontend we will be focussing on the Vue implementation.

Apollo Graphql Vuejs: https://github.com/vuejs/vue-apollo

Its straightforward to integrate in our existing explorer frontend project. https://github.com/threefoldtech/rivine-chain-explorer.

Apollo Graphql Vuejs uses vue and typescript. Which fits our needs. It also has local state and cache management functionality. https://apollo.vuejs.org/guide/local-state.html#local-state

For a preview of how this will work: https://apollo.vuejs.org/guide/apollo/#usage-in-vue-components

GlenDC commented 5 years ago

Graphql explorer db now syncs pretty fast. On my laptop with wallet updates it can do about 20 blocks per second. Which is a lot faster than we started with. Disk space is still an issue, we'll need to do the aggregated wallet data different, that is for sure.

Without wallet updates we can go up to 60 blocks per seconds on my laptop.

API is also cleared up, removed the reference point thing, and kept it at height only. Getting a block by height can be done using the blockAt query, and you can also get one starting from the end using a negative index, where -1 is the last block (same as omitting the height),-2` the second last block, etc...

GlenDC commented 5 years ago

A fix of previous commit was needed to be made. We need to ensure to commit regularly to the disk, as otherwise it will just spam the RAM full. So now we commit to disk every 1000 blocks, during the initial sync. This seems to resolve most of our RAM-blow up issues, except that the aggregated wallets are still a pain in the ***. We probably need to stop aggregating the identifier references in the wallet, and instead do it differently such that our wallets do not grow (more or less with some factor) linear with the blockchain height.

GlenDC commented 5 years ago

You can find links to documentation and examples of the current feat/graphql-phase1 implementation at: https://gist.github.com/GlenDC/13e60383dd82a682a0af4d770f0873f5.

GlenDC commented 5 years ago

Blocks can now be fetched multiple at the same time, blocks query examples have been added as example files and linked in the main file in the above gist-linked documentation.

One can also use filters, to filter on height, timestamp as well as define their own limits. This implementation works good so far, and uses cursor-based pagination. For the user the cursor is an opaque string, as to make it as easy as possible to use for a user.

DylanVerstraete commented 4 years ago

Right now it's not possible to retrieve a transaction id for an output his childinput. Picture explains

GlenDC commented 4 years ago

Right now it's not possible to retrieve a transaction id for an output his childinput. Picture explains

Parent transaction of an input can now also be retrieved (thus also including the ID)

GlenDC commented 4 years ago

Going to round off this task for now. It is not complete, but there might be urgent tasks that take priority over this. Given the time it already took so far.

What we have:

a working playground: https://explorer.rnd.threefoldtoken.com/explorer/graphql (@LeeSmet does need to update the daemon there to latest tfchain code on that graphql branch to get all latest changes that were added since 10 days ago or so)
There is documentation in markdown format: https://github.com/threefoldtech/rivine/blob/feat/graphql-phase1/modules/explorergraphql/schema.md
Generated from our GraphQL Schema, our truth of source: https://github.com/threefoldtech/rivine/blob/feat/graphql-phase1/modules/explorergraphql/schema.graphql
Some examples at: https://gist.github.com/GlenDC/13e60383dd82a682a0af4d770f0873f5
Documentation is also available in our playground, generated from that schema as well (as schema is documented).
Lazy resolvers, do DB calls only when needed, and do them once for that request, resolve what is needed and nothing more
allow querying of objects (typed or union-generic), multiple blocks at once (with or without filters), balances of wallets
485,000 blocks requires about 3 GB of data on disk (this uses msgpack, as it is a lot faster to unmarshal/marshal and it requires less space, we can probably even reduce this space with one line of code, by using optimized encoding option in the msgpack libray)

What we do not have:

store internal references (e.g. outputs in a transaction) using the 64bit DataIDs, instead of their rivine hashes (64 bytes):
- would reduce disk space required a lot
- requires a day of work if desired
query multiple wallets (with our without filters, including ordering if given by user) with pagination (cursor based, similar to how the blocks query work):
- about a day of work
- allows use cases such as getting the top 50 wallets based on unlocked tokens
subscriptions (e.g. subscribe on a socket to return new blocks):
- no experience with this part of GraphQL, conceptual as well as lib wise, so nut sure how much it would take, but I estimate max several days
be able to get outputs, transactions and blocks from a wallet query
- not sure how much time this needs, would depend how fast we find a space and compute efficient solution for this
- estimation is a week maximum

So far I am convinced of my choice for GraphQL, I think it is what we want/need. I certainly do not know of a better technology at the moment available for us to use that allows us to do what we want. More about this later.

As a database I currently use StormDB (v2) which uses boltDB. While it does seem to work for now, I am not convinced it is the most ideal solution. v3 promises to be a lot better, even allowing to use something else than boltDB, so perhaps this resolves all my doubts with StormDB as a choice. Even so, if you know of a better Database that is both efficient as well as allows us to do all the querying and indexing we need, feel free to suggest. My only desire would be that I can embed it, I find it easier for the user. But besides that I really have no strong opinion on the choice of backend, it is anyhow completely unrelated with my strong believe that GraphQL is the way to go for our chain web APIs.

This GraphQL proposal (and MVP implementation in this first unfinished phase) was done with good intentions, and thought process, yet not communicated clearly enough. Thanks to @DylanVerstraete we also have from a developer-as-a-user perspective a clear comparison how it is to support the old API (REST, explorer module) vs the new API (GraphQL), as Dylan tried both in his modern explorer. With the old API you require to write an entire parsing module understanding it all how it works, and a lot of work as a developer to get it right. On top of that you get a lot of data (sure you can add pagination, but that just makes you scale gigantic data, while in fact you do not need gigantic data). With the new API he has almost no work, no parsing to be done, and ReactJS/VueJS support GraphQL out of the box making it possible to even hook it directly up to your frontend components. We offer with the new API the same features as the old API, and even more (much more) features, while from the user it is easy to use and requires no work.

So why my choice for GraphQL in this proposal?

What you want from an API is that it is defined by a schema so it can serve as a contract. A contract that isn't broken (API's should be broken once defined as your contract). On top of that your documentation, server backends and clients should be generated from that contract so that you do not lose time in fixing bugs and trying to maintain all these things manually, as it is a battle you'll always lose (or throw a lot of unneeded resources in a very big cost to do it right).

GraphQL goes further than these base requirements I had. It is typed, giving a lot of validation out of the box in the generated code and making it a lot clearer to the user as well, as well as mapping nicely in generated client code. As GraphQL only gives you what you want you also do not need to learn the responses, as the responses look exactly like your queries, so learn your queries, and you know your responses. As the user only receives what she wants, you also can safely add new properties to types, without the fear that old users will now get a lot of extra data they do not need, as they won't receive that.

GlenDC commented 4 years ago

@robvanmieghem proposes that I use BCDB as a backend, as a test, to also resolve issue https://github.com/threefoldtech/jumpscaleX_threebot/issues/43.

Thus, I'll:

[ ] (1) make a branch of our feat/graphql-phase1 branch, that I'll call feat/graphql-bcdb, where I'll use BCDB instead of StormDB
[ ] (2) do some initial test to use the work of @zgorizzo69 done to try to talk from Go to BCDB to both read and write
[ ] (3) once task (2) works I'll have to adapt the actor so that my threebot package provides all the R/W calls I require to be able to populate my BCDB (for my consensus dumping) as well as read from it for my queries. As part of this task I'll also need to define the "final bcdb data schemas to facilitate my needs. Given stormdb seems very similar from a user experience to BCDB this should be simple enough to mirror for now, after an evaluation that could still be switched around.
[ ] (4) test, iterate and finish the first MVP, so it is ready for review by the stakeholders.

LeeSmet commented 4 years ago

Explorer for tfchain syncing crashed with the following output:

└┌(tfchain-graphql)┌¨˙./tfchain-graphql -Mcgq -v
Loading...
Binding API Address and serving the API...
Setting up root HTTP API handler...
Loading gateway (1/3)...
Loading consensus set (2/3)...
Loading graphql explorer (3/3)...
goroutine 60 [running]:
runtime/debug.Stack(0x4664f7, 0x0, 0xc000805ed8)
        /usr/lib/go/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
        /usr/lib/go/src/runtime/debug/stack.go:16 +0x22
github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/build.Critical(0xc000805fa0, 0x2, 0x2)
        /home/lee/go/src/github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/build/critical.go:15 +0xaa
github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/modules/explorergraphql.(*Explorer).InitialProcessConsensusChanges(0xc0000b2fc0, 0xc00007f380)
        /home/lee/go/src/github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/modules/explorergraphql/explorer.go:104 +0xc4
created by github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/modules/consensus.(*ConsensusSet).initializeSubscribe.func1
        /home/lee/go/src/github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/modules/consensus/subscribe.go:140 +0x606
Critical error: Explorer.ProcessConsensusChange failed failed to apply block: failed to get extension data from txn 58de24b753c9b4d1ec9e7994858a8ca848252542f658f39ef44d097fbf2a14fb (#2) of block 78c160e7d0bec959d416657511804c72f7f81bef2e833db0ee710db0f564a5d2: failed to unmarshal minter definition ext. data: unexpected EOF
Please submit a bug report here: https://github.com/threefoldtech/rivine/issues
panic: Critical error: Explorer.ProcessConsensusChange failed failed to apply block: failed to get extension data from txn 58de24b753c9b4d1ec9e7994858a8ca848252542f658f39ef44d097fbf2a14fb (#2) of block 78c160e7d0bec959d416657511804c72f7f81bef2e833db0ee710db0f564a5d2: failed to unmarshal minter definition ext. data: unexpected EOF
Please submit a bug report here: https://github.com/threefoldtech/rivine/issues

goroutine 60 [running]:
github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/build.Critical(0xc000805fa0, 0x2, 0x2)
        /home/lee/go/src/github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/build/critical.go:17 +0x136
github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/modules/explorergraphql.(*Explorer).InitialProcessConsensusChanges(0xc0000b2fc0, 0xc00007f380)
        /home/lee/go/src/github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/modules/explorergraphql/explorer.go:104 +0xc4
created by github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/modules/consensus.(*ConsensusSet).initializeSubscribe.func1
        /home/lee/go/src/github.com/threefoldfoundation/tfchain/vendor/github.com/threefoldtech/rivine/modules/consensus/subscribe.go:140 +0x606

also syncing seems to be pretty slow

GlenDC commented 4 years ago

Found a couple of minutes to check it out and I fixed it already. Please pull when you find some minutes yourself.

GlenDC commented 4 years ago

Forgot to mention, but <https://explorer.rnd.threefoldtoken.com/explorer/graphql > is live to play with. It is however clear by now that StormDB is an awe-full choice. I think having an external SQL (Postgres or SQLite) will be the way to go, it will also allow us to aggregate on the fly with complex queries. Until then however we'll see how BCDB will do. If all goes well it should at the very least do better than StormDB.

threefoldtecharchive / rivine

provide GraphQL explorer module for rivine (shorthand: q) #612