kadena-io / chainweb-data

Data ingestion for Chainweb.
BSD 3-Clause "New" or "Revised" License
14 stars 8 forks source link
chainweb haskell kadena

+TITLE: Chainweb Data

+AUTHOR: Colin

[[https://github.com/kadena-io/chainweb-data/actions/workflows/ci.yaml/badge.svg]] [[https://github.com/kadena-io/chainweb-data/actions/workflows/nix.yml/badge.svg]]

~chainweb-data~ stores and serves data from the Kadena Public Blockchain in a form optimized for lookups and analysis by humans. With this reindexed data we can easily determine mining statistics and confirm transaction contents.

~chainweb-data~ requires [[https://www.postgresql.org/][Postgres]] configured with UTF8 encoding. If you plan to host a chainweb-data instance on a cloud machine (e.g. Amazon EC2), we recommend that you run the postgres instance on an instance attached storage unit. The chainweb-node instance should have good performance to avoid timeouts during fill operations, use ssd storage for the db.

~chainweb-data~ can be built with either ~cabal~, ~stack~, or [[https://nixos.org/download.html][Nix]]. Building with ~nix~ is the most predictable because it is a single build command once you've installed Nix. This process will go significantly faster if you set up the Kadena nix cache as described [[https://github.com/kadena-io/pact/wiki/Building-Kadena-Projects][here]]. Building with ~cabal~ or ~stack~ will probably require a little more knowledge of the Haskell ecosystem.

+begin_example

git clone https://github.com/kadena-io/chainweb-data cd chainweb-data nix-build

+end_example

** Connecting to the Database

By default, ~chainweb-data~ will attempt to connect to Postgres via the following values:

| Field | Value | |---------+-------------| | Host | ~localhost~ | | Port | ~5432~ | | User | ~postgres~ | | Pass | Empty | | DB Name | ~postgres~ |

You can alter these defaults via command line flags, or via a [[https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING][Postgres Connection String]].

*** via Flags

Assuming you had set up Postgres, done a ~createdb chainweb-data~, and had configured user permissions for a user named ~joe~, the following would connect to a local Postgres database at port 5432:

+begin_example

chainweb-data --service-host= --dbuser=joe --dbname=chainweb-data

+end_example

*** via a Postgres Connection String

+begin_example

chainweb-data --service-host= --dbstring="host=localhost port=5432..."

+end_example

** Connecting to a Node

~chainweb-data~ syncs its data from a running ~chainweb-node~. The node's Service address is specified with the ~--service-host~ command. If a custom service enpoint port is used, you can specify it with ~--service-port~

+begin_example

chainweb-data --service-host=foo.chainweb.com ...

+end_example

*** Configuring the Node

~chainweb-data~ also needs some special node configuration. The ~server~ command needs ~headerStream~ and some of the other stats and information made available requires ~allowReadsInLocal~. Increasing the throttling settings on the node also makes the ~backfill~ and ~gaps~ operations dramatically faster.

+begin_example

chainweb: allowReadsInLocal: true headerStream: true throttling: global: 1000

+end_example

You can find an example node config in this repository in node-config-for-chainweb-data.yaml.

** How to run chainweb-data

When running chainweb-data for the first time you should run ~chainweb-data server -m~ (with the necessary DB and node options of course). This will create the database and start filling the database with blocks. Wait a couple minutes, then run ~chainweb-data fill~ (again with the necessary options). After the ~fill~ operation finishes, you can run ~server~ again with the ~-f~ option and it will automatically run fill once a day to populate the DB with missing blocks.

** Commands

*** listen

~listen~ fetches live data from a ~chainweb-node~ whose ~headerStream~ configuration value is ~true~.

+begin_example

chainweb-data listen --service-host=foo.chainweb.com --dbuser=joe --dbname=chainweb-data DB Tables Initialized 28911337084492566901513774

+end_example

As a new block comes in, its chain number is printed as a single digit. ~listen~ will continue until you stop it.

*** server

~server~ is just like ~listen~, but also runs an HTTP server that serves a few endpoints for doing common queries. Additionally, it can serve an OpenAPI v3 spec of the API when the hidden ~--serve-swagger-ui~ option is enabled, offering a basic interface for interacting with the API. This feature, however, is kept unofficial for now due to its rudimentary documentation.

By specifying the optional ~--no-listen~ argument, the server can be made read-only, allowing multiple servers to serve from the same database.

**** Endpoints

For more detailed information, see the API definition [[https://github.com/kadena-io/chainweb-api/blob/master/lib/ChainwebData/Api.hs#L24][here]].

**** Note about partial search results

All of ~chainweb-data~'s search endpoints (~/txs/{events,search,account}~) support a common workflow for efficiently retrieving the results of a given search in non-overlapping batches.

A request to any one of these endpoints that match more rows than the number asked with the ~limit~ query parameter will respond with a ~Chainweb-Next~ response header containing a token. That token can be used to call the same endpoint with the same query parameters plus the token passed in via the ~next~ query parameter in order to retreive the next batch of results.

~chainweb-data~ supports a ~Chainweb-Execution-Strategy~ request header that can be used (probably by ~chainweb-data~ operators by setting it in the API gateway) to enable an upper bound on the amount of time the server will spend for searching results. Normally, the search endpoints will produce the given ~limit~-many results if the search matches at least that many entries. However, if ~Chainweb-Execution-Strategy: Bounded~ is passed in, the response can contain less than ~limit~ rows even though there are potentially more matches, if those matches aren't found quickly enough. In such a case, the returned ~Chainweb-Next~ token will act as a cursor for the search, so it's possible to keep searching by making successive calls with subsequent ~Chainweb-Next~ tokens.

*** fill

~fill~ fills in missing blocks. This command used to be called ~gaps~ but it has been improved to encompass all block filling operations.

+begin_example

chainweb-data fill --service-host=foo.chainweb.com --dbuser=joe --dbname=chainweb-data

+end_example

*** backfill

Deprecated: The backfill command is deprecated and will be removed in future releases. Use the ~fill~ command instead.

~backfill~ rapidly fills the database downward from the lowest block height it can find for each chain.

Note: If your database is empty, you must fetch at least one block for each chain first via ~listen~ before doing ~backfill~! If ~backfill~ detects any empty chains, it won't proceed.

+begin_example

chainweb-data backfill --service-host=foo.chainweb.com --dbuser=joe --dbname=chainweb-data DB Tables Initialized Backfilling... [INFO] Processed blocks: 1000. Progress sample: Chain 9, Height 361720 [INFO] Processed blocks: 2000. Progress sample: Chain 4, Height 361670

+end_example

~backfill~ will stop when it reaches height 0.

*** backfill-transfers

~backfill-transfers~ fills entries in the transfers table from the highest block height it can find for each chain up until the height that events for coinbase transfers began to exist.

Note: If the transfers table is empty, you must fetch at least one row for each chain first via ~listen~ before doing ~backfill-transfers~! If ~backfill-transfers~ detects any empty chains, it won't proceed.

*** gaps

Deprecated: The backfill command is deprecated and will be removed in future releases. Use the ~fill~ command instead.

~gaps~ fills in missing blocks that may have been missed during ~listen~ or ~backfill~. Such gaps will naturally occur if you turn ~listen~ off or use ~single~.

+begin_example

chainweb-data gaps --service-host=foo.chainweb.com --dbuser=joe --dbname=chainweb-data DB Tables Initialized [INFO] Processed blocks: 1000. Progress sample: Chain 9, Height 361624 [INFO] Processed blocks: 2000. Progress sample: Chain 9, Height 362938 [INFO] Filled in 2113 missing blocks.

+end_example

*** single

~single~ allows you to sync a block at any location in the blockchain.

+begin_example

chainweb-data single --chain=0 --height=200 --service-host=foo.chainweb.com --dbuser=joe --dbname=chainweb-data DB Tables Initialized [INFO] Filled in 1 blocks.

+end_example

Note: Even though you specified a single chain/height pair, you might see it report that it filled in more than one block. This is expected, and will occur when orphans/forks are present at that height.

*** migrate

~migrate~ allows you to migrate the database schema to the latest version and exit. This can be useful for separating the migration step from running the ETL and/or HTTP service.

+begin_example

chainweb-data migrate --dbuser=joe --dbname=chainweb-data

+end_example

*** check-schema

~check-schema~ is used to perform a check of the ORM definitions against the DB schema.

+begin_example

chainweb-data check-schema --service-host=foo.chainweb.com --dbuser=joe --dbname=chainweb-data

+end_example

** Specializing the Database Schema

A common use case for ~chainweb-data~ is to primarily run it as a worker process to populate a Postgres database with blockchain data. In this case, ~chainweb-data~ operators often want to run their schema migrations and modify the schema according to their needs. Obviously, by introducing arbitrary schema changes, we can not guarantee unimpeded operation of ~chainweb-data~. Any node operator that wishes to modify the database, takes on the responsibility of ensuring that their changes do not interfere with the current operation of ~chainweb-data~. A node operator is also responsible for considering their changes in the face of future releases of ~chainweb-data~.

~chainweb-data~ provides a way to help with this process. Any version of ~chainweb-data~ comes with a set of schema migrations included in the binary that are applied by default to the database at migration time. These migrations are defined in the ~haskell-src/db-schema/migrations~ directory. It is possible to override these migrations by calling ~chainweb-data~ with the optional ~--migrations-folder~ argument. However, in order to add migrations to the default set, an ~--extra-migrations-folder~ argument is provided.

The default migrations that come with ~chainweb-data~ have the following file name format: ~X.Y.Z.N_NAME.sql~. Here ~X.Y.Z~ is the version of ~chainweb-data~ after which the migration was introduced. ~N~ is the migration number. These migrations are executed in "alphabetical order" considering X,Y,Z and N to be the elements by which they are sorted. The migration procedure will fetch the already executed migrations from the database and check them against the migrations provided through the ~--extra-migrations-folder~ and the ~--migrations-folder~ arguments. If the already executed migrations are a prefix (i.e. they were run in the correct order and have no gaps or extras) of the expected migrations, then the rest of the migrations will be executed.

By taking advantage of this alphabetical sorting, ~chainweb-data~ operators can insert custom migrations to be executed at the moment they desire. For example, version 2.3.0 of ~chainweb-data~ will have migrations done on top of version 2.2.0, thus having migrations named ~2.2.0.N...~ (~N>=1~). Creating a custom migration named ~2.3.0.0.N...~ will guarantee that it'll be executed after the new migrations that come with version 2.3.0 and before the new migrations of future versions, which are guaranteed to have a name greater than ~2.3.0.1...~. Likewise, ~chainweb-data~ operators that run the latest commit from the ~master~ branch can also inject their migrations. For example, if the latest commit has the last migration named ~2.2.0.1...~, then their migrations can be named ~2.2.0.1.N_...~.

It's important to note, that running ~chainweb-data~ from an unreleased commit of the ~master~ branch is not officially supported and even though we aim to avoid it, we can change new migrations of the ~master~ branch without notice, so you may have to fix your database manually by undoing migrations and removing ~schema_migrations~ entries.

~chainweb-data~ operators that specialize their database schema are strongly advised to review the incoming migrations before they upgrade their ~chainweb-data~ versions. This will allow them to detect any potential conflicts and insert new schema migrations to be executed at the right moment, to accommodate the incoming changes.