harmony-one / bounties

Bounty program is to help the community take part in the development of the Harmony blockchain. It covers from core feature to validator tooling, from dApp development to DeFi integration.
MIT License
59 stars 23 forks source link

bigquery support on google cloud platform #49

Closed LeoHChen closed 2 years ago

LeoHChen commented 3 years ago

Description

Make Harmony blockchain data available on google GCP bigquery

Context

Google BigQuery provides a powerful interface for data queries and analyses. We would like to make Harmony blockchain data available on Google BigQuery, as public data set.

For example, the following lists the current public data set available on big query platform. https://cloud.google.com/blog/products/data-analytics/introducing-six-new-cryptocurrencies-in-bigquery-public-datasets-and-how-to-analyze-them

image

Acceptance Criteria

Reward

US$2,000 equivalent amount of Harmony ONE token.

xenowits commented 3 years ago

I m interested @LeoHChen . Should I take this up?

LeoHChen commented 3 years ago

sure, @xenowits , you may start, we are adding it to our gitcoin bounty round two.

xenowits commented 3 years ago

Thanks! Started work

gitcoinbot commented 3 years ago

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


This issue now has a funding of 22245.3645 ONE (2003.33 USD @ $0.09/ONE) attached to it.

gitcoinbot commented 3 years ago

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


Work has been started.

These users each claimed they can complete the work by 1 week, 5 days from now. Please review their action plans below:

1) farque65 has started work.

Make harmony one blockchain data available on google bigquery 2) cpurta has started work.

I am planning to pull harmony blockchain data and begin uploading that data into GCP BigQuery where the dataset can be shared publicly. Once that is done I plan to create an automation script to allow for continuous blockchain data to be available through BigQuery. 3) studybug has started work.

Export Harmony Blockchain data to the Bigquery Platform. 4) fuad-ardiono has started work.

Insert harmony block chain data to GCP big query using observable and reactive programming methods 5) nifrali has started work.

Create an ETL for ONE data tp GCP bigquery

Learn more on the Gitcoin Issue Details page.

studybug commented 3 years ago

I am happy to help too!

gitcoinbot commented 3 years ago

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


Work for 22245.3645 ONE (1647.18 USD @ $0.07/ONE) has been submitted by:


sophoah commented 2 years ago

@cpurta I looked at your entry which is so far a really great submission. I had multiple issue trying to run your bounty :

Please help out to fix all the issues including the addition of some improvement logic like testing the existence of the database and table then create it if it not created.

Thanks

cpurta commented 2 years ago

@sophoah thank you for the feedback on the submission.

sophoah commented 2 years ago

@cpurta the compilation is building a binary that refers to dynamic libraries that needs to be present in the system (or image in our case) when i tested and build the image via your Dockerfile glibc was missing, give it a try and you'll understand.

let me know once the changes has been pushed and I can help test again.

cpurta commented 2 years ago

@sophoah I have pushed the changes that check for the needed tables and create them if needed. I also attempted to dig into the compilation issue you mentioned but I am unable to recreate it without much context. Can you provide some of the commands you are using when compiling the binary?

sophoah commented 2 years ago

@cpurta thanks, I am bit busy right now but i'll come back to you by next week. The is not with compiling the binary that works without issue, the issue is that same compile binary is then copied into the container and the container is missing the dynamic library.

i fixed by adding in the dockerfile

ENV GLIBC_REPO=https://github.com/sgerrand/alpine-pkg-glibc
ENV GLIBC_VERSION=2.30-r0

RUN set -ex && \
    apk --update add libstdc++ curl ca-certificates && \
    for pkg in glibc-${GLIBC_VERSION} glibc-bin-${GLIBC_VERSION}; \
        do curl -sSL ${GLIBC_REPO}/releases/download/${GLIBC_VERSION}/${pkg}.apk -o /tmp/${pkg}.apk; done && \
    apk add --allow-untrusted /tmp/*.apk && \
    rm -v /tmp/*.apk && \
    /usr/glibc-compat/sbin/ldconfig /lib /usr/glibc-compat/lib

thanks

cpurta commented 2 years ago

@sophoah do you have any other feedback on the project?

sophoah commented 2 years ago

@cpurta sorry for the delay without the dockerfile change above I have this error :

standard_init_linux.go:219: exec user process caused: no such file or directory

I think it's best for the binary to be build the run as part of the docker image creation directly

With a working image, when I start I have this error :

error running program: googleapi: Error 404: Not found: Dataset harmony-bigquery:crypto_harmony, notFound

so dataset is not created automatically

Manually creating the dataset and running the container again worked and syncing is now ongoing. Stopping the container and restarting works. The process detects the last synced block and start again a sync from there

Can you please fix the Dockerfile so the binary is created during the build and add the auto dataset creation.

Should I have at one point transactions table to created and filled ? so far i don't have any

@LeoHChen this bounty is really close to finish

cpurta commented 2 years ago

@sophoah I believe that I resolved both of the issues that you mentioned. The binary is now being compiled within a golang docker container and is then mounted onto the alpine Linux image using a multi-stage build. I have tested that the docker build works and the application container will run the binary. I updated the project README with an updated docker build command that is now run from the project root directory as opposed to the docker directory. So make sure to use that command as the previous one will not work now.

Also, I added a check for the project dataset_id specified and should create the dataset within the project if it does not exist. I was unable to smoke test this as my free tier for GCP has expired. Do you mind testing the changes to make sure that they are up to par with the specifications of the bounty?

sophoah commented 2 years ago

@cpurta the build part is working fine no thanks

Now when I start the container and I didn't delete the data that was being indexed so far, i have :

{"level":"error","ts":1632302303.588839,"caller":"runner/backfill.go:179","msg":"unable to insert block transactions into BigQuery","block_number":10158169,"error":"googleapi: Error 404: Not found: Table harmony-bigquery:crypto_harmony.transactions, notFound","stacktrace":"github.com/cpurta/harmony-one-to-bigquery/internal/runner.(*BackfillRunner).backfillBlocks\n\t/go/src/github.com/cpurta/harmony-one-to-bigquery/internal/runner/backfill.go:179"}

so on that I've created manually the transactions table, and i then had :

{"level":"error","ts":1632302541.6222308,"caller":"runner/backfill.go:179","msg":"unable to insert block transactions into BigQuery","block_number":10160571,"error":"googleapi: Error 400: The destination table has no schema., invalid","stacktrace":"github.com/cpurta/harmony-one-to-bigquery/internal/runner.(*BackfillRunner).backfillBlocks\n\t/go/src/github.com/cpurta/harmony-one-to-bigquery/internal/runner/backfill.go:179"}

so please help out to auto-create the table and and make sure all blocks that has been indexed before have their transactions filled up in the transaction table.

Thanks

cpurta commented 2 years ago

@sophoah I just got around to addressing the issues you presented. I just updated the program to create the dataset and tables needed even after deletion. Let me know when you have tested it out and if there are any issues you encounter.

sophoah commented 2 years ago

@cpurta

{"level":"error","ts":1633421894.5614786,"caller":"runner/backfill.go:186","msg":"unable to insert block transactions into BigQuery","block_number":16340108,"error":"googleapi: Error 400: The destination table has no schema., invalid","stacktrace":"github.com/cpurta/harmony-one-to-bigquery/internal/runner.(*BackfillRunner).backfillBlocks\n\t/go/src/github.com/cpurta/harmony-one-to-bigquery/internal/runner/backfill.go:186"}
cpurta commented 2 years ago

@sophoah I am unable to re-create that error when testing deletion of the transactions table and performing a manual create without a schema. But I have put a check in place that examines the schema of each table and it should align with what will be inserted. If it fails the check the program will end with a helpful error message.

I believe that this is a pretty acceptable change as the automatic table creation will create the table with the correct schemas.

I have also tested a complete bootstrap of the dataset and tables. Also if a table is dropped it will create the table needed.

Please try out the latest changes and let me know if there are any other issues encountered. 🙏

sophoah commented 2 years ago

@cpurta

2021-10-07T05:52:24.142Z        DEBUG   runner/backfill.go:94   checking if dataset exists
2021-10-07T05:52:24.452Z        DEBUG   runner/backfill.go:106  checking if transactions table exists
error running program: `harmony-bigquery.crypto_harmony.transactions` table schema does not match schema.TransactionsTableSchema, please fix schema in GCP console and re-run

I don't think in normal operation we would have such error, so for now we can skip the auto fix of that error. However I have then re run the container (with the table deleted) and have this now :

2021-10-07T05:54:28.693Z        DEBUG   runner/backfill.go:94   checking if dataset exists
2021-10-07T05:54:29.052Z        DEBUG   runner/backfill.go:106  checking if transactions table exists
error running program: googleapi: Error 404: Not found: Table harmony-bigquery:crypto_harmony.transactions, notFound
cpurta commented 2 years ago

@sophoah good catch. I found the issue and resolved it. Just tested with removing the transactions table and also tested removal of the blocks table too.

Some things to note with the removal of the tables and re-creation is that GCP will cache table metadata and cause some errors in the program to be thrown when attempting to insert. To account for this I implemented a retry mechanism with a linear backoff. More on the issue of table re-creation and streaming inserts can be found here.

sophoah commented 2 years ago

hello @cpurta look good now, the app can start without issue. So now I am doing random check on the data itself and the most obvious thing i am noticing is that, the transaction table doesn't have any transaction with blocks before 15millions and a lot as well in the 16m range, so i can see new transactions being added towards the remaining block to be synced, but wondered if there is a backfill function to go back from block 0 to ensure all transaction has been added ?

cpurta commented 2 years ago

@sophoah just doing a spot check against the API I am seeing quite a few blocks below the 10M range that do not contain any transactions. While that does not account for a lot from the 10M-15M range that is missing it seems like perhaps logging a warning to notify of a block missing transactions could be useful.

There is currently not a backfill function to sync for missing transactions it could be implemented as a separate command. Are you just thinking that we go from block 0 to the most recent synced block, check if the block has transactions, if so check if the transactions exist in BQ, if the transactions do not exist in BQ insert them?

sophoah commented 2 years ago

@cpurta we had transaction since block 0 :) but true there wasn't a lot. I think INFO/WARNING is fine for me, it's just to inform there were no tx on a given block.

Doing a backfill through separate command to sync missing transaction is ok, if that can work at the same time as the current main command running, that would perfect. Let me know once you have it since I believe if no other error, that would be the last item.

On a side question, how difficult would that be to include other shards (s1, s2, s3) or even network like testnet?

cpurta commented 2 years ago

@sophoah that makes sense. I did add an INFO message when there are not any transactions within a block. I also added a command flag to start the backfill from a specific block in order to better debug the issue. I am not able to replicate the issue and performed an analysis of the number of transactions received from the API. It has so far lined up with what is inserted. I did change how the transaction inserts are performed (i.e. made them more atomic (not-batched)). So that made have made the difference. There are also some unit tests to attempt to rule out any bugs that may arise from marshaling/unmarshalling

To answer your side question, I don't think it would be difficult at all to include other shards if those are available via another RPC endpoint, as that is configurable through the CLI/env flags (i.e. --node-url). If you are wanting to have all shards in one table it would require multiple backfills that are pulling from each shard. If that is the case and you want to identify which shard a block/transaction came from, a new column would be needed to identify the shard, which could be configured from the CLI flags (or perhaps interpreted from the node-url flag).

If you are wanting a table per shard it would be as simple as running a backfill with the destination tables having either a prefix or suffix of the shard name (e.g. --gcp-blocks-table-id=block_s1, --gcp-txns-table-id=transactions_s1, etc.). The auto-creation of the tables should make this fairly automated to run multiple backfills for every shard.

sophoah commented 2 years ago

hey @cpurta thanks for the side questions.

Can you advise what's the command flag if I want to restart the sync from a given block number? Right now the transactions table has 37,666,230 rows, and our explorer dashboard (even tough also having shard 1/2/3) is displaying 184,783,217, a good number for shard 0 would be at least 150m

Thanks Regards

cpurta commented 2 years ago

@sophoah you can now specify --start-block flag to start the backfill from a specific block so long as the specified block number is not greater than the most recent block submitted to the Harmony blockchain.

sophoah commented 2 years ago

@cpurta

docker exec -it gcp-bigquery hmy-bq-import --start-block 1                                                                                               
Incorrect Usage. flag provided but not defined: -start-block

NAME:
   hmy-bq-import - command line tool to import Harmony One blockchain data into GCP BigQuery

USAGE:
   hmy-bq-import [global options] command [command options] [arguments...]

VERSION:
   v0.0.1

AUTHOR:
   Chris Purta <cpurta@gmail.com>

COMMANDS:
   backfill  pull historical block data from Harmony One blockchain and insert into GCP BigQuery
   help, h   Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --help, -h     show help (default: false)
   --version, -v  print the version (default: false)
error running program: flag provided but not defined: -start-block
[ec2-user@ip-172-31-19-248 harmony-one-to-bigquery]$ 
cpurta commented 2 years ago

@sophoah you will need to specify the backfill command:

docker exec -it gcp-bigquery hmy-bq-import backfill --start-block 1 
cpurta commented 2 years ago

@sophoah are there any other outstanding issues that are holding up this work from being accepted?

sophoah commented 2 years ago

hey @cpurta sorry for the delay

Last week I started a catchup job from block 1 and in my logs, I can see insertion logs

2021-10-28T07:22:10.495Z        DEBUG   runner/backfill.go:246  inserting transactions  {"block_number": 5379526, "num_txns": 1}

@LeoHChen for me all good, we'll just need on our end to have an instance that keeps running to do the indexing continuously and for all shards.

LeoHChen commented 2 years ago

Thanks @sophoah. I'd close this bounty, Please create an issue for ourselves to keep updating the indexing. also, please work with the @cpurta to write a blog post on how to use the big query on gcp for harmony blockchain data. Thanks.

gitcoinbot commented 2 years ago

Issue Status: 1. Open 2. Started 3. Submitted 4. Done


The funding of 22245.3645 ONE (7033.14 USD @ $0.3/ONE) attached to this issue has been cancelled by the bounty submitter