Dealing with partially valid claims

olizilla commented 11 months ago

Our existing claims allow for the possibilty of partially valid claims to be created. For example

assert/partition (rootCid, [carCid,...]) maps a content cid to a set of car cids. It is therefore possible to make claims where

(valid) the rootCid and all the of the blocks it links to can be found in the set of CARs
(invalid) the rootCid and none of the blocks it links to can be found in the set of CARS
(?) the rootCid and some of the blocks it links to can be found in the set of CARs

the partial claim can be used to find some of the cars the dag is in, but not all of them. In the absence of a fully valid claim, it's existence would be strictly better than nothing.

aside: this is how our upload/add capability works today. Users send us a (rootCid, carCid) pair and send the same rootCid with a different carCid each time to build up the CAR set, like a partition claim builder. Each upload/add call is a partial (or invalid) assert/partition claim.

Other examples

a assert/inclusion (carCid, indexCid) claims where the car index is incomplete. Where we only want to read a given sub-dag/file/entity from a super set, and the car index includes entries for those cids, then it's useable, even tho it does not index every block in the car.
a assert/inclusion (carCid, indexCid) claims where the car is incomplete or contains a block with bytes that don't match it's cid. (We assert that we will store CARs as slabs of bytes, we don't ensure the CARs are valid at the block level)... Where we only want to read a given sub-dag/file/entity from a super set, and the car includes blocks for those cids, then it's useable, even tho the car does not include every block listed in the index.

olizilla commented 11 months ago

I think this is adjacent to, but not quite the same as the issue that @vasco-santos is raising around "if we make it easy and permissionless to create claims and we read all claims in an undifferentiated manner during a read, we allow for the possibility of folks slowing down or denying access to content".

An example there would be "create a million invalid partition claims for a popular content cid"

Vasco is offering us some ideas for mitigation of this in the short term... make the system more restrictive about who can make claims about what until we have a proposal for some kind of reputation system or per claim scoring.

To my ears it sounds like "you can only make claims that are about or include CIDs that you have in your space" sounds like a quick win to slow down the (potential) spammers.

If we prefer claims signed by the system, we could get away with nothing more than listening out for complaints and having a CLI or CI job that we can invoke ourselves to make a valid claim for any cid that is getting spammed (until such time as claim spam becomes a thing)

Gozala commented 11 months ago

I think our current problem is misalignment of what we need on reads vs what we put in claims. Specifically if claims does not include all the information we need for specific read it is going to be incomplete yet not incorrect. If claim contains more information than we need for specific read we are not able to determine if claim is correct.

This tells me that our claims are too generic trying to cover different reads and consequently be valid for some reads and invalid (incomplete) for the other reads. Solution here probably is to have different claims for different kind of reads we support so they can't be incomplete.

As far as I understand we currently support following types of reads

CAR read by CID
Block read by CID
File / Directory DAG read by CID (from gateway)

Lets consider each one and how could we satisfy them

CAR read by CID

This is the most straight forward using assert/location claim. I am not sure we need to support client issued claims here, we could still make them available, but we do not have to use them.

Although after user wrote to R2 we may want to be notified in some way.

Block read by CID

I don't think we currently have the claim to describe this well, so I'll suggest something like assert/digest claim with a following input

type AssertDigest = {
  digest: Uint8Array // multihash digest
  payload: CID // perhaps should be multihash as well
  offset?: number // offset within the payload
  length?: number // number of byes been hashed
}

We could possibly use even more compact representation if would like. Also we don't care the block CID because codec and cid version is irrelevant.

We may want to optimize this to reduce overhead of doing it per block basis especially signing them. But perhaps we could support bundles like
type Multihash = ToString<Uint8Array>
type Payload = { source?: CID, offset?: number, length?: number }
type AssertDigest = Record<Multihash, Payload>
That way bundle could still be partially valid but we would be able to propagate only parts that we validate

DAG read by CID

I think that is more or less what assert/relation defines, however it makes some details optional and seems like we could have claims that don't provide full enough context to support full read. I think something like assert/layout might be more effective way for a clients to enable (complete) DAG reads by CID

type AssertLayout = {
  // Root node
  content: Link
  // Links across all the DAG nodes 
  edges: Record<Link, Link[]>
  // Reference to all the nodes required to build this DAG
  source: Record<ToString<Shard>, Record<ToString<Multihash>, {offset?: number, length?: number}>>
}

We could probably do even better for UnixFS stuff without storing all the intermediary nodes as those could be assembled on the fly, but that probably would introduce enough complexity that it's best to leave out for now

Gozala commented 11 months ago

We could probably fold block read by CID and DAG read by CID into same claim, if there are edges it in the DAG it effectively becomes the block

olizilla commented 11 months ago

I agree that this is a "we need to iterate on the claims" problem.

I don't want folks to have to send us a claim per block. For example, what if the existence of an inclusion claim (carCid, indexCid) expanded such that you could then query for any location in the index. The user has already provided block level "this multihash is at this position in that car" assertions, we're just not exposing it... You can only reach it via already having the car cid. In this case we want to facilitate the reverse look up.

olizilla commented 11 months ago

Note that we share batches of multihashes in a single IPNI advert, but we can query it by any multihash in the batch. By using the carCID as the ipni advert context id, we can use IPNI to map from any multihash to the CAR it's in, which is probably how I'm gonna keep bitswap working.

olizilla commented 10 months ago

@Gozala i like where these new claim shapes are going. for Block read by CID we can already make a location claim with a url and range that allows to say how to fetch the bytes of a block. https://github.com/web3-storage/content-claims?tab=readme-ov-file#location-claim

e.g if content is a single block in car we can make a location claim that says "this block lives at that url at that byte range". Is there anything else we need there?

olizilla commented 10 months ago

We could probably do even better for UnixFS stuff without storing all the intermediary nodes as those could be assembled on the fly, but that probably would introduce enough complexity that it's best to leave out for now

This is an interesting direction. @alanshaw and I have been talking about "top of tree" claims a lot, where all the non-leaf block are captured and shared in one round trip for the reader. However that still implies a use-case where folks are doing block level reads rather than just CARs as slabs of bytes. /musing

storacha / content-claims