Closed steven-sheehy closed 4 years ago
+cc @jsindy so he can follow discussions here.
Did some minor reading.
It should be fine to just add .requestPayer("requester")
to ListObjectsRequest.builder()
in Downloaded.java.
See example of javascript request - https://aws.amazon.com/blogs/developer/the-aws-sdk-for-javascript-now-supports-amazon-s3-requester-pays-buckets/
This way anyone running mirror node would have acknowledged they are a requestor. Then when we run we'd essentially be the equivalent of a subscriber to our own bucket and would be billed appropriately as we do now. S3 has logic to handle this based on account details.
Might require/be good creating a new account in s3 with the appropriate permissions that would match a non Hedera client requestor.
It's easy to make an existing bucket RP or go back and forth (non-RP <--> RP). https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-requester-pays-console.html In my tests, I found that effect of toggling RP on/off was immediately visible to clients.
Setup:
Used my personal aws account to setup two buckets: appy1
and appy1-requester-pays
.
Ran following tests: https://github.com/hashgraph/hedera-mirror-node/blob/s3_requester_pays/hedera-mirror-importer/src/test/java/com/hedera/mirror/importer/downloader/RequesterPayBucketTest.java
Summary:
+-------------------+---------------+-----------+
| | Free bucket | RP bucket |
+-------------------+---------------+-----------+
| anonymous request | ✓ (current) | ✗ |
| RP request | ✓ (migrating)| ✓ (final) |
+-------------------+---------------+-----------+
Would be great if someone else can run the tests too with access/secret key not belonging to owner (me).
Key takeaway: Since setting requestPayer
works for both non-RP/RP buckets, we'll set it for all requests. In fact, this is critical for migration.
Very minimal.
We already have configs to set access/secret keys for the S3 client.
In Downloader.java
, when making ListObjectsRequest
, we just need to add requestPayer(RequestPayer.REQUESTER)
. It's exactly as Nana pointed above.
Testing: Not sure how we can have good checked-in tests for it. We currently use S3Mock for testing Downloader, but it doesn't support authentication (which means it won't have support to test RP). Am sure we can cook up some basic testing though.
Just configuring access/secret key would be enough. Hedera's importer, like any other external importer, will need to be configured with IAM credentials too.
There are two way we can do the migration:
a. We make the necessary code changes. Release them. b. Announce to partners a date (of X months later) as deadline to update their importers. (current --> migrating state in above table). c. On the date, during migration window, we toggle our free bucket to be RP. (migrating --> final)
a. Setup duplicate bucket which will be RP from the start. b. Announce to partners a date (of X months later) as deadline to update their importers and use RP bucket (current --> final state in above table). c. On the date, during migration window, we delete the free bucket.
a. Verify GCP bucket is not used by anyone (otherwise this strategy doesn't work) b. Test running mirror node with GCP bucket (to build trust). Convert it to RP. Build more trust. c. Announce to partners a date (of X months later) as deadline to update their importers and use GCP RP bucket (current --> final state in above table). d. On the date, without migration window, make AWS bucket RP.
Pros/Cons:
S3 cost for single mirror node: \~350$ per month (record stream: \~340$; balance stream: \~5$) Costs depend significantly on polling frequency. Above are for polling frequency of 500ms for record stream and 30s for balance stream. For example, increasing record stream poll frequency from 500ms to 1s will bring the cost down to \~175$.
Cost savings for hedera: 350$ * number of external nodes.
Confirm that GCP buckets are not used by anyone right now (Brad/Josh)
We are seeing ListObject and WriteObject requests @ 8-9/sec. No GetObject requests. It means no one is using that bucket very likely. If a standard configuration mirror node was using that bucket, we’d see ListObject requests @ 40+/sec (20 per stream) and some GetObject requests. To gain more certainty, Josh will try to find if there’s a way to get IPs of those clients.
Migrate GCP buckets to RP first
We’ll be testing uploader compatibility using one of the pre-prod envs on Monday. We’ll switch the bucket to RP, and see if uploader still work. If so, we’ll call it a day and go home. JK, we’ll already be at home. If not, test option 2: Use user access/secret keys. User will need default project id set in GCP. This is some tricky thing in GCP interoperability mode where service accounts don’t cut it. That’s how I made the mirror node work. If not, option 3 would be major changes to uploader (~weeks).
Measure latency of GCP buckets
Metric hedera_mirror_transaction_latency
measures time between a transaction achieving consensus (consensusTimestamp) and it being processed by mirrornode parser.
The flow between two events looks like:
tx achieves consensus --> nodes write stream files --> uploaded to S3/GCP --> downloaded by mirror node --> consensus is verified --> transactions are parsed one at a time
While individual value of this metric is not useful (since tx may be in start or end of stream file), the aggregate across many files is perfect for our case. Just seeing changes to this metric is enough to measure latency impact of S3 vs GCS.
In Kibana, the metric looks like Important thing to note is, Kibana doesn't have individual values, only the aggregate for 30sec. This is fine for our current case.
Transaction latency before and after the switch. The odd peak around 4/6 00:00 is when i stopped the importer to switch from S3 to GCP. Latencies are in same range. So i believe we are good here.
Setup:
GCP bucket: appy-demo-streams
(exact copy for hedera-demo-streams
)
Initially, bucket has RP off.
Started mirror node importer with appropriate access/secret keys set.
Test: Toggled RP on --> off --> on --> off --> on. Importer kept working smooth as butter.
All next steps/outstanding questions are devops tasks. Updated description to mention the same. Since there's no remaining task to be done by product eng, closing this ticket.
Problem Currently the mirror node bucket is not public so the community can't run the software or contribute to the project. We should have some way to make it easier for the community to participate without incurring a large S3 cost.
Solution Investigate requester pays buckets. Main things are:
Followups from discussion on 3/23:
dev
env is using testnet gcp bucket. (4/6)Next steps:
Outstanding questions:
Good to have:
Alternatives
Setup a public example bucket with a small amount of datadoneAdditional Context