storacha-network / w3infra

🏗️ Infra for the w3up UCAN protocol implementation
Other
12 stars 4 forks source link

Decide on DB schema + implement #25

Closed olizilla closed 1 year ago

olizilla commented 1 year ago

Currently, uploaderDID tracks the resource that an upload is invoked with. When we are using spaces, the with is the space DID. so UploaderDID is the space DID which feels wrong.

see: https://github.com/web3-storage/upload-api/blob/75f0b5327d3dc2cface553bf60f9e9ebc17598c9/api/test/service/store.test.js#L85-L86

olizilla commented 1 year ago

If we focus the schema change to just include the properties we need to implement the capabilities, and updating names to be consistent with our new terminology then we end up with:

store table

ucan terminology flavour - deliberately stick to the language of the invocation we are capturing

issuer resource link size origin? ucanCID insertedAt
did:key:space did:key:space bagy...1 101 bagy...0 baf...ucan 2022-11-23T...
string string string number string string string (ISO 8601)
agent DID space DID stored CAR CID CAR size bytes prev shard CAR CID invocation CID wen db record

domain mapped flavour - name columns after how we expect them to be commonly used.

agentDID spaceDID carCID size origin? ucanCID insertedAt
did:key:space did:key:space bagy...1 101 bagy...0 baf...ucan 2022-11-23T...
string string string number string string string (ISO 8601)
agent DID space DID stored CAR CID CAR size bytes prev shard CAR CID invocation CID wen db record

notes

ucanCID is the CID for the complete, validated UCAN invocation that was sent to us and caused this row to be inserted. The current proposal is to store the bytes of that UCAN in s3 keyed by it's (raw) CID.

open questions

olizilla commented 1 year ago

References

alanshaw commented 1 year ago
gobengo commented 1 year ago

wrt the thing I mentioned about billing. From what I grok about the way the store protocol works, I think the aggregate we'll probably need is

FWIW here is the 'usage reports' api stripe has, which we'll want to to update if we use stripe as the usage-based billing provider.

olizilla commented 1 year ago

The proposed dynamodb schema:

/** @type import('@serverless-stack/resources').TableProps */
export const storeTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    car: 'string',          // `bagy...1`
    size: 'number',         // `101`
    origin: 'string',       // `bagy...0` (prev CAR CID. optional)
    agent: 'string',        // `did:key:agent` (issuer of ucan)
    ucan: 'string',         // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + car must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'car' },
}

/** @type import('@serverless-stack/resources').TableProps */
export const uploadTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    sk: 'string',           // `root#shard` + space must be unique for dynamo index constraint
    root: 'string',         // `baf...x`
    shard: 'string',        // `bagy...1
    agent: 'string',        // `did:key:agent` (issuer of ucan)
    ucan: 'string',         // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + sk must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'sk' },
}

References

Gozala commented 1 year ago

Few notes:

  1. I think we need at least one more field for provider DID. Initially we don’t plan on allowing more than one storage provider, but we do want that long term. So we’d want to track it.
  2. I’m not sure agentDID is better than issuer because there’s no way to enforce that invariant and users can issue invocations with spaceDID instead.
    • I’m also not sure why do we even want to store this outside of UCAN itself. Are there any queries that would require it ?
  3. I think ucanCID / ucan shod be called invocation instead, because that would describe semantics, while UCAN describes formatting.
  4. Reason I chose not to use car as field in capabilities is because it again describes format and not semantics. We may also add support for different formats in the future so calling things by format isn’t ideal IMO. I would use same name as in capability, if that name is bad we probably should rename that in capability as well.
  5. Do we store invocations that were proved invalid ? Or denied because storage limits?
  6. We should also store response CIDs somewhere.
  7. We may get valid requests to store, but user may never actually upload to S3 don’t know how and where to track this, but we should somewhere. Also pre-signed URLs may expire and probably we should note when / if it does
olizilla commented 1 year ago

I think we need at least one more field for provider DID...

Yep! I'm gonna leave that out from this iteration, and add it in once we tackle providers. We can add later without needing a migration.

I’m not sure agentDID is better than issuer because there’s no way to enforce that invariant

I agree. I'm happy to call it issuer

I’m also not sure why do we even want to store this (the issuer) outside of UCAN itself

I was assuming some need to show "which identity actually invoked the thing" but we could drop it for now.

I think ucanCID / ucan should be called invocation instead

I can deal with calling it invocation. ucan is so short and neat tho.

Reason I chose not to use car as field in capabilities is because it again describes format and not semantics. We may also add support for different formats in the future so calling things by format isn’t ideal IMO. I would use same name as in capability, if that name is bad we probably should rename that in capability as well.

I think folks are struggling with link as the term for a CIDlike thing. It's very very generic, where CID was clear to folks what they were dealing with. A case could be made for renaming it, but I don't know if there is appetite for more renames.

However, I'm generally in favour of having the db labels match the labels in the invocation wherever sensible. I'll rename it link here and we can see if folks want to discuss if link should change in the capability, and I'll adopt the outcome of that, if/when it happens.

Do we store invocations that were proved invalid ? Or denied because storage limits?

not currently but i think we should track that somewhere. Do you have a use-case in mind? My assumption we'd want this for at least tracking metrics, and debugging. Would an aggregated count here be sufficient or would we want dump them in a bucket, or track them in a full on db table?

We should also store response CIDs somewhere.

Can do! Is the use-case here analytics and debugging? or something more?

We may get valid requests to store, but user may never actually upload to S3 don’t know how and where to track this, but we should somewhere. Also pre-signed URLs may expire and probably we should note when / if it does

ACK. Yes, tracking "when does the ObjectCreated event occur in s3" is on my radar. Tracking the expire time for the URL is a good idea!

olizilla commented 1 year ago

That gives us an updated dynamodb schema like so:

/** @type TableProps */
export const storeTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    link: 'string',         // `bagy...1`
    size: 'number',         // `101`
    origin: 'string',       // `bagy...0` (prev CAR CID. optional)
    issuer: 'string',       // `did:key:agent` (issuer of ucan)
    invocation: 'string',   // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + car must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'link' },
}

/** @type TableProps */
export const uploadTableProps = {
  fields: {
    space: 'string',        // `did:key:space`
    sk: 'string',           // `root#shard` + space must be unique for dynamo index constraint
    root: 'string',         // `baf...x`
    shard: 'string',        // `bagy...1
    issuer: 'string',       // `did:key:agent` (issuer of ucan)
    invocation: 'string',   // `baf...ucan` (CID of invcation UCAN)
    insertedAt: 'string',   // `2022-12-24T...`
  },
  // space + sk must be unique to satisfy index constraint
  primaryIndex: { partitionKey: 'space', sortKey: 'sk' },
}
Gozala commented 1 year ago

I think folks are struggling with link as the term for a CIDlike thing. It's very very generic, where CID was clear to folks what they were dealing with. A case could be made for renaming it, but I don't know if there is appetite for more renames.

However, I'm generally in favour of having the db labels match the labels in the invocation wherever sensible. I'll rename it link here and we can see if folks want to discuss if link should change in the capability, and I'll adopt the outcome of that, if/when it happens.

I agree link is not a great name & in fact fails to describe semantics. Maybe content is better ?

we also have an issue on file suggesting that small cars should be inlined instead, in which case link (and cid) would be even less meaningful.

Yep! I'm gonna leave that out from this iteration, and add it in once we tackle providers. We can add later without needing a migration.

Sounds good! Just to be clear though, it will affect primary index & we bill who hired a provider.

I can deal with calling it invocation. ucan is so short and neat tho.

I think alternatively it could be called source which is bit shorter & still meaningful.

In regards to other info I brought up. I don’t have specific use cases, it’s just all these invocations have certain state that we don’t seem to capture. It would be difficult to tell what the state of the individual invocation is, or how did invocation progressed through the system.

we also wanted to include some info in the receipt so introspection seems important