storacha-network / w3infra

🏗️ Infra for the w3up UCAN protocol implementation
Other
13 stars 5 forks source link

fix: type error on JSON.stringify in Agent Store #396

Closed hannahhoward closed 1 month ago

hannahhoward commented 1 month ago

Goals

I believe this to be the root cause of https://github.com/storacha-network/project-tracking/issues/75

What I can see is that our invocations are freequently throwing exceptions on the API:

2024-06-22T02:16:16.257Z    b1e35d62-a193-4192-8b2f-e30484c63378    WARN    WriteError: Write to store has failed: TypeError: Do not know how to serialize a BigInt
    at AgentMessageStore.write (/upload-api/stores/agent.js:57:16)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at handle (/node_modules/@web3-storage/upload-api/src/lib.js:111:23)
    at ucanInvocationRouter (/upload-api/functions/ucan-invocation-router.js:298:20)
    at processResult (/node_modules/src/awslambda.ts:326:50) {
  cause: TypeError: Do not know how to serialize a BigInt
      at JSON.stringify (<anonymous>)
      at assert (/upload-api/stores/agent/stream.js:136:16)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at write (/upload-api/stores/agent/stream.js:73:18)
      at AgentMessageStore.write (/upload-api/stores/agent.js:53:47)
      at handle (/node_modules/@web3-storage/upload-api/src/lib.js:111:23)
      at ucanInvocationRouter (/upload-api/functions/ucan-invocation-router.js:298:20)
      at processResult (/node_modules/src/awslambda.ts:326:50),
  payload: {
    data: Message {
      root: [Object],
      store: [Map],
      _invocations: [],
      _receipts: [Map]
    },
    source: { headers: [Object], body: [Uint8Array] },
    index: [ [Object], [Object] ]
  },
  writer: AgentMessageStore {
    connection: { store: [Object], stream: [Object] },
    invocations: InvocationsIndex { connection: [Object] },
    receipts: ReceiptsIndex { connection: [Object] }
  }
}

This is maybe 1 in 10 invocations. Moreover, this happens AFTER the actual invocation occurs, when we're publishing the receipt to the store and then the Kinesis Stream.

The storage to the store works but this error occurs writing to the Kinesis Stream... which is what gets picked up by the billing api and used to calculated space diffs.

I can see on my own account, when I upload, an exception gets triggered, and my space DID never shows in the UCAN stream handler in the billing API.

The only bit I can't figure out is why the size values are coming in as BigInts on the receipts. There are a couple things that could happen here -- size comes directly from the invocation (the receipt is built from the invocation) which in turn is decoded from CBOR in the incoming car. There is a scenario in which CBOR can decode as bigint though I genuinely am not sure why that happens with the sizes in question.

Implementation

So a simple fix here is to convert bigints to numbers on the call to JSON.Stringify. Theoretically we really shouldn't have numbers over Number.MAX_SAFE_INTEGER. Though there is probably a better fix.

It might be useful to actually try to create the conditions in which a test would fail... bit weird cause the agent store tests are over in w3up. Or try to replicate the exact sequence in which this happens.

Then again, the simplest thing may be to merge this, see if it fixes the space usage issue, and proceed from there.

Note that this does NOT solve the historical issue that space diffs may be out of date due to missed calls to store space usage deltas. We probably need to figure out a way to replay these values or just recalc space usage from current.

seed-deploy[bot] commented 1 month ago
View stack outputs - **pr396-w3infra-BillingDbStack** Name | Value -- | -- customerTableName | pr396-w3infra-customer spaceDiffTableName | pr396-w3infra-space-diff spaceSnapshotTableName | pr396-w3infra-space-snapshot usageTable | pr396-w3infra-usage - **pr396-w3infra-BillingStack** Name | Value -- | -- ApiEndpoint | https://rcq8kq47j3.execute-api.us-east-2.amazonaws.com billingCronHandlerURL | https://ebg5dnzypcw4avxgdxv3ymlkbu0srbrj.lambda-url.us-east-2.on.aws/ CustomDomain | https://pr396.billing.web3.storage - **pr396-w3infra-CarparkStack** Name | Value -- | -- BucketName | carpark-pr396-0 Region | us-east-2 - **pr396-w3infra-RoundaboutStack** Name | Value -- | -- ApiEndpoint | https://bgp56js104.execute-api.us-east-2.amazonaws.com CustomDomain | https://pr396.roundabout.web3.storage - **pr396-w3infra-UcanInvocationStack** Name | Value -- | -- invocationBucketName | invocation-store-pr396-0 taskBucketName | task-store-pr396-0 workflowBucketName | workflow-store-pr396-0 - **pr396-w3infra-UploadApiStack** Name | Value -- | -- ApiEndpoint | https://cdihb6oa7e.execute-api.us-east-2.amazonaws.com CustomDomain | https://pr396.up.web3.storage - **pr396-w3infra-BusStack** - **pr396-w3infra-FilecoinStack** - **pr396-w3infra-ReplicatorStack** - **pr396-w3infra-UcanFirehoseStack** - **pr396-w3infra-UploadDbStack**
alanshaw commented 1 month ago

I saw this also! See https://github.com/storacha-network/w3infra/pull/393 - I was convinced it was an "error" receipt. i.e. a receipt where an error had bee caught and we'd tried to return it ...} catch (err) { return { error: err } }. Caught errors can come from anywhere so I though it was the AWS library attaching a bigint to the error object...but perhaps this is not right if the error is still occurring.

That said, I'm pretty sure it's the receipt out that's causing this. There's been a refactor here but essentially nothing has changed in terms of encoding data to be put to the stream (at least as far as I can tell) but we have got new capabilities (blob protocol and index/add) that may be triggering it.

If you can log out the receipt out I think that will help debugging further.

hannahhoward commented 1 month ago

@alanshaw I added some code to log the receipts for now -- note this isn't really a final fix but will at least get the items written and give us insight on what's going wrong.