NEAR Lake indexers support of GCS storage

gmilescu commented 1 year ago

See https://pagodaplatform.atlassian.net/wiki/spaces/EAP/pages/363298818/Indexers to address naming confusion if any

Context

Lake indexers are currently running for mainnet and testnet. Lake Indexer instances are writing data to AWS S3 buckets (separate for each chain).

We are aware of an existing limitation on the AWS side. The limit is 5r requests per second for a single bucket disregarding the client. Those limits are not flexible and cannot be increased.

Although we don’t have proper monitoring (and I am unsure whether we even can have it), somehow, using Databricks, we’ve built an approximate estimator to count req/s for the current state of things.

The screenshot of the chart was created around Mar 30, 2023.

NEAR Lake Mainnet Usage Dashboard in Databricks ~~(it is broken for some reason)~~

UPDATE:

Dashboard is working showing different figures now

The main information from this chart is that we have around 2000 req/s with spikes of around 2500 req/s on our mainnet S3 bucket. So we’re halfway to the limit.

TL;DR

AWS S3 req/s limits are for a bucket not for user/client
Limit is 5000 req/s and cannot be increased
Current usage is 2000 req/s with 2500 req/s spikes

GCS alternative

Google Cloud Platform is our company-wide infrastructure of choice. The Google team wonders why don’t we use GCS (S3-compatible storage from Google) for our Lake indexer.

I haven’t talked to them, but they are ready to help copy the data from S3 to GCS for us. Thus it’s not that complex for us to spin up Lake indexers that are going to write to GCS (this kind of customization of the Lake Indexer is built-in from the beginning. Kudos to me and a random guy from the community for the PR enabling it in the early days of Lake indexer ��)

I must admit I am not talking about deprecating the AWS S3. It may harm the community. We want to add GCS as an alternative.

The good news about GCS is that their limits to the buckets are the same as AWS ones. However, these limits can be changed if we provide good reasoning to the Google team.

TL;DR

GCP is an infra of choice
GCS limit are the same but can be increased
GCS is going to be additional alternative, we continue using AWS S3, too

FAQs

Here are questions I’ve asked frequently in the last year

Why do we use AWS in the first place?

At the moment, we’ve been releasing Lake Indexer (+ Framework) GCS wasn’t S3-compatible. We wanted to use something the majority of the developers in the community are familiar with. And everybody knows what S3 is.

Why did you choose JSON instead of ?

A couple of reasons:

At that moment, our main focus was on Web 2.0 developers curious about Web 3.0
No offenses, but near-primitive serialization into JSON and back is more likely to happen deterministically and doesn’t frequently require updates from our side. Example: The Protocol team changes some fields that used to be String (base64 encoded bytes) to Vec. Still, it proceeded to be a base64-encoded string in JSON.

How to achieve it

We need to set up GCS buckets for NEAR Lake Indexer instance to write to (separate for every chain we want)
We need to configure the buckets with the Requester Pays feature enabled.
1. We want Pagoda to be charged for the requests for data from within the company (our indexers)
2. We want clients to be charged for access to the buckets for their indexers (that’s how it works with AWS S3 now)
Do necessary changes to the Lake Frameworks (Rust, TypeScript) so developers can specify where to read data from
1. Code changes might not even be required, but good-to-have shortcuts, so we provide a good DevEx
2. We need to update READMEs and perhaps some tutorials to educate developers about it

ND-518 created by None

gmilescu commented 1 year ago

I don’t want to overload the initial description with the update of what has already been done, so I am doing it in the comment section.

A couple of weeks before the company-wide wellness week, I created a GCS bucket for the testnet
- Link to GCS bucket | It is under pagoda-data-stack-dev GCP project
And spun up the lake-indexer instance to write to that bucket:
- n2-lake-indexer-testnet-gcs VM | pagoda-data-stack-dev GCP project
  - I missed the 1.35.0 release of the nearcore, so it got stuck eventually, but it managed to write some data into the bucket so we could experiment.
Following the docs about the Requester Pays feature, I enabled the feature on the bucket.
- With this feature enabled Lake Indexer instance could not write the data to the bucket.
  - It can write with the feature disabled though
  - With the feature enabled, the API requires PROJECT_ID to be billed to be provided. This is a huge difference from the expected 100% S3-compatible API
  - The Google team finally reached out to discuss our issues so I will update everybody later.

TL;DR

testnet Lake indexer and bucket were set up
Requester Pays feature doesn’t work as expected
GCS API with the Requester Pays feature requires PROJECT_ID to be provided (to know what project to bill)
It is not 100% S3-compatible API
I am going to talk to Google team today about it

Honestly, I don’t expect the Google team can and would change the API because of us. I can accept the reasoning behind why they require PROJECT_ID. I can hope they can share info about some workarounds I haven’t found (I didn’t even look, to be honest).

We’re not doomed because of it, which means we need to explicitly add the support for GCS, which I was hoping to avoid.

by 61c24882bce5e000697cf541

gmilescu commented 1 year ago

Update after call with Google team

We had a call with the Google team on Friday, 14 Jun

I’ve shared the details of the architecture of NEAR Lake Framework and how we use AWS S3, along with the requirements for any S3-compatible API we expect:

Our instances of Lake Indexer write to the bucket exclusively
The bucket is publicly readable but with Requester Pays:
- AWS/Google charges the client for accessing the data from the bucket
- We mustn’t be charged for end users accessing the data
- We prefer to avoid changing the S3-client library we use

I predicted the Google guy would bring a gang with him, so I showed up with my backup George Milescu and Eduardo Ohe �� (thanks again for joining)

We figured that Google guys didn’t know about the Requester Pays feature. They even started to assume what this feature could be about �� In the end, we agreed they would talk to other teammates familiar with the feature and would come back to us with a new piece of information.

Meanwhile, they said they are ready to assist us with copying the data from AWS S3 buckets to GCS ones, but they tipped us it might be an extra charge (cc Ernesto Cejas Padilla Rob Tsai )

The follow-up email we received from them after the call

Hi everyone,

Thanks again for your time today. I appreciate you explaining your workflow in detail. I have reached out to some folks to help with your questions. In the meantime, I was able to find this Crate for Rust to interact with GCS. I am not sure it covers all your needs, but it might be a place to start if you're not using it already. We did not see an official GCS client for Rust, but that's part of what I plan to ask more about internally.

As soon as I have more information, I'll be in touch. Have a great weekend.

by 61c24882bce5e000697cf541

gmilescu commented 11 months ago

Update after call with Google team Jul 26

We had another call with the Google team today. Our squad was: Eduardo Ohe Andrei Mustuc and Bohdan Khorolets

The Google team brought new people to the loop who knew about the Requester Pays feature.

They were convincing us that the way GCS was developed was good, and they didn’t want to repeat AWS’s mistakes, so they are not 100% compatible.

Quick recap: the only blocker for us to use GCS is that we do want to prevent Lake Framework API from changing, and we don’t want to write new features or change the Lake Framework’s S3 client used under the hood (official AWS S3 Rust client by the way)

The main difference in GCS API is that the Requester Pays feature requires PROJECT_ID to be provided to know what project to bill for the usage of the service. While AWS reads this information from the credentials provided (just the access key and secret access key).

Mattew from Google said similar behavior on GCS should be achievable using service accounts. However, he said it is not working like that now. But they promised to figure out if adding this task to their working loop is possible.

From our side, we’re taking a few action items:

I want to check using service accounts because it seems logical for them to work. I’d like to try it out myself.
I need to have a closer look into the S3-client we use and whether it is possible to add custom get-variables to the request the library does. Since the PROJECT_ID is required to be passed like ?userProject=PROJECT_IDENTIFIER, If we can hack it without forking and additional maintenance, that’d solve our blocker.

Eduardo Ohe Andrei Mustuc feel free to add anything I could’ve missed and you think is relevant to be stored for history.

by 61c24882bce5e000697cf541

gmilescu commented 11 months ago

Some context: in AWS an IAM user is tied to a master account ID (1 billing only), GCP on another hand has all the projects owned by an organisation, with possible multiple billing accounts (in case separate billing is enabled).
GCP service accounts key files (json) contain all the relevant data like:

project_id
private_key_id
private_key
some other info which i’m not sure is mandatory

I’m mentioning this because even for other than Requester Pays features, GCP needs the account project ID.

by 63b74c4d7cde7bff9d7ac618

gmilescu commented 11 months ago

Hi Bohdan Khorolets and Andrei Mustuc , some additional points:

I understand that we want to keep using the same AWS S3 rust library in GCS to have no/minimal changes on the indexer code (to write and read the files).
I think we might need a cloud agnostic solution to store the files. So far it has been AWS, now in GCS, soon Alibaba OSS (https://www.alibabacloud.com/help/en/oss/developer-reference/pay-by-requester-3 ), and potentially https://www.arweave.org/ and others. Not sure if all the options will be 100% compatible with the same Rust client library, maybe we could have an abstraction storage layer to have more flexibility.
For short term maybe we could have the GCS bucket with the Requester Pays disabled and we can start using the files for internal purposes only (QueryAPI, Databricks load, BigQuery load, etc)

by 63b75612741248746bf8a243

gmilescu commented 11 months ago

Alibaba support will be needed soon – Near Foundation announced a deal with them and there is a lot of effort to bring developers and web3 users from China to NEAR. Given then they can’t access Google or AWS services, they will need their own lake. So we need to create an internal abstraction to write to any cloud provider and configure it using unique set of config variables, which is not known right now.

by 62322ab11c09d200701471ac

near / nearcore