Open gmilescu opened 1 year ago
I don’t want to overload the initial description with the update of what has already been done, so I am doing it in the comment section.
TL;DR
Honestly, I don’t expect the Google team can and would change the API because of us. I can accept the reasoning behind why they require PROJECT_ID. I can hope they can share info about some workarounds I haven’t found (I didn’t even look, to be honest).
We’re not doomed because of it, which means we need to explicitly add the support for GCS, which I was hoping to avoid.
by 61c24882bce5e000697cf541
We had a call with the Google team on Friday, 14 Jun
I’ve shared the details of the architecture of NEAR Lake Framework and how we use AWS S3, along with the requirements for any S3-compatible API we expect:
I predicted the Google guy would bring a gang with him, so I showed up with my backup George Milescu and Eduardo Ohe �� (thanks again for joining)
We figured that Google guys didn’t know about the Requester Pays feature. They even started to assume what this feature could be about �� In the end, we agreed they would talk to other teammates familiar with the feature and would come back to us with a new piece of information.
Meanwhile, they said they are ready to assist us with copying the data from AWS S3 buckets to GCS ones, but they tipped us it might be an extra charge (cc Ernesto Cejas Padilla Rob Tsai )
The follow-up email we received from them after the call
Hi everyone,
Thanks again for your time today. I appreciate you explaining your workflow in detail. I have reached out to some folks to help with your questions. In the meantime, I was able to find this Crate for Rust to interact with GCS. I am not sure it covers all your needs, but it might be a place to start if you're not using it already. We did not see an official GCS client for Rust, but that's part of what I plan to ask more about internally.
As soon as I have more information, I'll be in touch. Have a great weekend.
by 61c24882bce5e000697cf541
We had another call with the Google team today. Our squad was: Eduardo Ohe Andrei Mustuc and Bohdan Khorolets
The Google team brought new people to the loop who knew about the Requester Pays feature.
They were convincing us that the way GCS was developed was good, and they didn’t want to repeat AWS’s mistakes, so they are not 100% compatible.
Quick recap: the only blocker for us to use GCS is that we do want to prevent Lake Framework API from changing, and we don’t want to write new features or change the Lake Framework’s S3 client used under the hood (official AWS S3 Rust client by the way)
The main difference in GCS API is that the Requester Pays feature requires PROJECT_ID to be provided to know what project to bill for the usage of the service. While AWS reads this information from the credentials provided (just the access key and secret access key).
Mattew from Google said similar behavior on GCS should be achievable using service accounts. However, he said it is not working like that now. But they promised to figure out if adding this task to their working loop is possible.
From our side, we’re taking a few action items:
Eduardo Ohe Andrei Mustuc feel free to add anything I could’ve missed and you think is relevant to be stored for history.
by 61c24882bce5e000697cf541
Some context: in AWS an IAM user is tied to a master account ID (1 billing only), GCP on another hand has all the projects owned by an organisation, with possible multiple billing accounts (in case separate billing is enabled).
GCP service accounts key files (json) contain all the relevant data like:
I’m mentioning this because even for other than Requester Pays features, GCP needs the account project ID.
by 63b74c4d7cde7bff9d7ac618
Hi Bohdan Khorolets and Andrei Mustuc , some additional points:
by 63b75612741248746bf8a243
Alibaba support will be needed soon – Near Foundation announced a deal with them and there is a lot of effort to bring developers and web3 users from China to NEAR. Given then they can’t access Google or AWS services, they will need their own lake. So we need to create an internal abstraction to write to any cloud provider and configure it using unique set of config variables, which is not known right now.
by 62322ab11c09d200701471ac
See https://pagodaplatform.atlassian.net/wiki/spaces/EAP/pages/363298818/Indexers to address naming confusion if any
Context
Lake indexers are currently running for mainnet and testnet. Lake Indexer instances are writing data to AWS S3 buckets (separate for each chain).
We are aware of an existing limitation on the AWS side. The limit is 5r requests per second for a single bucket disregarding the client. Those limits are not flexible and cannot be increased.
Although we don’t have proper monitoring (and I am unsure whether we even can have it), somehow, using Databricks, we’ve built an approximate estimator to count req/s for the current state of things.
The screenshot of the chart was created around Mar 30, 2023.
NEAR Lake Mainnet Usage Dashboard in Databricks
(it is broken for some reason)UPDATE:
The main information from this chart is that we have around 2000 req/s with spikes of around 2500 req/s on our mainnet
S3 bucket. So we’re halfway to the limit.
TL;DR
GCS alternative
Google Cloud Platform is our company-wide infrastructure of choice. The Google team wonders why don’t we use GCS (S3-compatible storage from Google) for our Lake indexer.
I haven’t talked to them, but they are ready to help copy the data from S3 to GCS for us. Thus it’s not that complex for us to spin up Lake indexers that are going to write to GCS (this kind of customization of the Lake Indexer is built-in from the beginning. Kudos to me and a random guy from the community for the PR enabling it in the early days of Lake indexer ��)
I must admit I am not talking about deprecating the AWS S3. It may harm the community. We want to add GCS as an alternative.
The good news about GCS is that their limits to the buckets are the same as AWS ones. However, these limits can be changed if we provide good reasoning to the Google team.
TL;DR
FAQs
Here are questions I’ve asked frequently in the last year
Why do we use AWS in the first place?
At the moment, we’ve been releasing Lake Indexer (+ Framework) GCS wasn’t S3-compatible. We wanted to use something the majority of the developers in the community are familiar with. And everybody knows what S3 is.
Why did you choose JSON instead of?
A couple of reasons:
How to achieve it
ND-518 created by None