fix: publish to IPNI async

alanshaw commented 2 months ago

This PR fixes an issue where a call to space/index/add does not have enough time to complete with the timeout forced by AWS API Gateway (29s).

When a CAR file has many thousands of blocks, it can take >29s to write all the blocks to the old E-IPFS dynamo table and to the E-IPFS multihashes queue (which feeds IPNI advertisements).

Previously looking at the exectuion times for lambdas I considered this possible, but clearly I did not encounter an upload with thousands fo blocks in the time window I observed.

The PR introduces an IPNI stack. It currently consists of 2 queues and 2 lambda consumers. 1 queue is for IPNI and the other is for writing to dynamo.

Before

graph TD;
  subgraph w3infra
  upload-api[upload-api index/add handler]
  end
  subgraph E-IPFS
  upload-api-->|batch of up to 10x multihashes| multihashes-queue[SQS multihashes-queue];
  upload-api-->|batch of up to 25x entries| blocks-cars-position[DynamoDB blocks-cars-position table];
  end
  multihashes-queue-->|...| IPNI;

The issue is:

The existing multihashes SQS queue expects 1 multihash per message and using batch send will only allow 10 messages max.
Writing to the DynamoDB table allows only 25 puts to be batched.

This times out for indexes of many thousands of entries and causes the API Gateway service to respond with 503.

After

graph TD;
  subgraph w3infra
  upload-api[upload-api index/add handler]-->|1 message with up to 3000x multihashes| block-advert-publisher-queue[SQS block-advert-publisher-queue];
  block-advert-publisher-queue-->block-advert-publisher-consumer(ʎ-consumer);
  upload-api-->|1 message with up to 2000x index entries| block-index-writer-queue[SQS block-index-writer-queue];
  block-index-writer-queue-->block-index-writer-consumer(ʎ-consumer);
  end
  subgraph E-IPFS
  block-advert-publisher-consumer-->|300x batch of 10 multihashes| multihashes-queue[SQS multihashes-queue];
  block-index-writer-consumer-->|80x batch of 25 entries| blocks-cars-position[DynamoDB blocks-cars-position table];
  end
  multihashes-queue-->|...| IPNI;

The solution is to create queues that allow messages to be sent to them with many items per message. The queue consumers then use batch sends to write to existing E-IPFS mutlihashes queue and dynamo table.

Each lambda consumer gets 15 minutes to execute which should be more than enough time to write to their targets.

Note this essentially re-introduces an unknown async time between index/add and the data becoming available on the gateway due to the queue. ...which I had hoped to avoid with the transition to the blob protocol.

seed-deploy[bot] commented 2 months ago

View stack outputs

- **pr395-w3infra-BillingDbStack** Name | Value -- | -- customerTableName | pr395-w3infra-customer spaceDiffTableName | pr395-w3infra-space-diff spaceSnapshotTableName | pr395-w3infra-space-snapshot usageTable | pr395-w3infra-usage - **pr395-w3infra-BillingStack** Name | Value -- | -- ApiEndpoint | https://8c7gepavq0.execute-api.us-east-2.amazonaws.com billingCronHandlerURL | https://d2pfz3iqdnyrk5hz57pxqbnftq0fkkhl.lambda-url.us-east-2.on.aws/ CustomDomain | https://pr395.billing.web3.storage - **pr395-w3infra-CarparkStack** Name | Value -- | -- BucketName | carpark-pr395-0 Region | us-east-2 - **pr395-w3infra-RoundaboutStack** Name | Value -- | -- ApiEndpoint | https://538675qdwl.execute-api.us-east-2.amazonaws.com CustomDomain | https://pr395.roundabout.web3.storage - **pr395-w3infra-UcanInvocationStack** Name | Value -- | -- invocationBucketName | invocation-store-pr395-0 taskBucketName | task-store-pr395-0 workflowBucketName | workflow-store-pr395-0 - **pr395-w3infra-UploadApiStack** Name | Value -- | -- ApiEndpoint | https://zifxqr2ji0.execute-api.us-east-2.amazonaws.com CustomDomain | https://pr395.up.web3.storage - **pr395-w3infra-BusStack** - **pr395-w3infra-ElasticIpfsStack** - **pr395-w3infra-FilecoinStack** - **pr395-w3infra-IpniStack** - **pr395-w3infra-ReplicatorStack** - **pr395-w3infra-UcanFirehoseStack** - **pr395-w3infra-UploadDbStack**

hannahhoward commented 2 months ago

@hannah will review today!

alanshaw commented 2 months ago

... and I don't fully understand the diff between EIPFSService & IPNIService

The IPNI stack was consuming the E-IPFS stack. I just separated them out incase we wanted to use any E-IPFS stack components in other stacks. In hindsight I think it's unlikely.

In general we separate out stacks to avoid circular dependencies. e.g. we have an upload-api stack and a upload-db. The billing stack uses the upload-db stack. The upload-api stack uses the billing stack AND the upload-db stack. This wouldn't be possible if the DB was not a separate stack.

seed-deploy[bot] commented 2 months ago

Stack outputs updated

- **pr395-w3infra-BillingDbStack** Name | Value -- | -- customerTableName | pr395-w3infra-customer spaceDiffTableName | pr395-w3infra-space-diff spaceSnapshotTableName | pr395-w3infra-space-snapshot usageTable | pr395-w3infra-usage - **pr395-w3infra-BillingStack** Name | Value -- | -- ApiEndpoint | https://8c7gepavq0.execute-api.us-east-2.amazonaws.com billingCronHandlerURL | https://d2pfz3iqdnyrk5hz57pxqbnftq0fkkhl.lambda-url.us-east-2.on.aws/ CustomDomain | https://pr395.billing.web3.storage - **pr395-w3infra-CarparkStack** Name | Value -- | -- BucketName | carpark-pr395-0 Region | us-east-2 - **pr395-w3infra-RoundaboutStack** Name | Value -- | -- ApiEndpoint | https://538675qdwl.execute-api.us-east-2.amazonaws.com CustomDomain | https://pr395.roundabout.web3.storage - **pr395-w3infra-UcanInvocationStack** Name | Value -- | -- invocationBucketName | invocation-store-pr395-0 taskBucketName | task-store-pr395-0 workflowBucketName | workflow-store-pr395-0 - **pr395-w3infra-UploadApiStack** Name | Value -- | -- ApiEndpoint | https://zifxqr2ji0.execute-api.us-east-2.amazonaws.com CustomDomain | https://pr395.up.web3.storage - **pr395-w3infra-BusStack** - **pr395-w3infra-FilecoinStack** - **pr395-w3infra-IndexerStack** - **pr395-w3infra-ReplicatorStack** - **pr395-w3infra-UcanFirehoseStack** - **pr395-w3infra-UploadDbStack**

storacha / w3infra

fix: publish to IPNI async #395

Before

After