Allow File Data Sources to update an entity

azf20 commented 2 years ago

Currently, different File Data Sources can create multiple entities with the same ID.

File Data Sources should instead be able to update an entity with the same ID. Note that this will not be an upsert, which is the current pattern that is used for chain-based data sources. Instead this should completely overwrite the prior entity.

Situation:

file data source A created in block 1
A is found, saves entity X
file data source B created in block 5
file data source C created in block 10
C is found, saves entity X again <-- this is the most recent
B is found, saves entity X again

Entity updates should apply a "most recent wins" approach, where the time is determined by block time, not handler execution time. This is a new pattern (where entities might be created with closed block ranges), and order will need to be resolved within blocks (as well as between blocks)

leoyvens commented 2 years ago

Considering the future requirement of time-travel queries of availability blocks, I now recall that we can't do most-recent-wins at indexing time, it requires quadratic space to represent on the DB as has been demonstrated in past discussions.

To do most-recent-wins at query time, we need to allow conflicting entity versions to coexist in the DB. But I fear that it would not be possible to write efficient SQL to handle conflict resolution for collection queries.

So I'm becoming skeptical of generalized 'most-recent wins' conflict resolution. That brings us back to a solution I previously proposed that looks like (strawman syntax):

metadata: Metadata @derivedFrom(field: "project") @mostRecent

Where there can be multiple Metadata referring to the project, each with their own ID, but the field declares that it wants the most recent one.

azf20 commented 2 years ago

OK. I was thinking about this, and while it may be a bit confusing from a user perspective, there is a robust way to generate the distinct IDs, based on the CID, i.e.

export function handleProjectMetadata(content: Bytes): void {
  const cid = String.UTF8.decode(dataSource.address().buffer)
  const data = json.fromBytes(content)
  const _projects = metaPtrData.toArray();
  for (let i = 0; i < _projects.length; i++) {

    // construct projectId
    const _project = _projects[i].toObject();
    const _id =  _project.get("id")
    if (!_id) continue;
    const projectId = _id.toString().toLowerCase();

    const metadata = new Metadata(cid + "-" + projectId)
    metadata.project = projectId
    metadata.save()
  }
}

To discuss syntax:

metadata: Metadata @derivedFrom(field: "project") @mostRecent
metadata: Metadata @derivedFrom(field: "project") @lastUpdate
metadata: Metadata @derivedFrom(field: "project", selectBy: "mostRecent")

Other questions: will this decorator only be available for file data source entities?

I would be keen to unpack the trade offs here though, as we don't currently have the availability chain - how simple will this change be for the query layer, with all its permutations (interfaces, derived fields, unions etc)? I think (?) the introduction of an availability chain would mean removal of either workaround (indexing time or query time), so I have a preference for whichever is simpler (for users, and to implement then update)

leoyvens commented 2 years ago

This directive would essentially apply a sort order and take the first. In principle it could apply to any derived single-entity field, but I haven't analyzed the implications if the field type is an interface.

What seems complicated to me about the sql queries for collection fields is the interaction with first and skip. But this is beyond my SQL-fu, we'd need @lutter's opinion to determine what is feasible.

A directive that is only supported by derived single-entity fields punts on the question of collection fields, which is convenient since those are not relevant to the use case at hand. The directive being at the field granularity avoids incurring any performance costs to unrelated queries.

This query-time solution would not change with the introduction of the availability chain, it would only change as much as any other query. An indexing-time solution would probably need to assume a total order between the availability chain and the data chain, but we're trying to avoid answering that question at this point.

azf20 commented 2 years ago

A directive that is only supported by derived single-entity fields punts on the question of collection fields

I think this is the right approach. Given that approach, I don't think we need to worry about first and skip?

leoyvens commented 2 years ago

Exactly, that is one of the goals. For derived single-entity fields, it seems "obviously possible" to implement because we can implement it as a collection query with first: 1 and the chosen sort order.

On the sort order, I'm thinking it would be order by lower(block_range) desc, causality_region desc, id asc. The id is there to guarantee uniqueness since there can be entities created in the same file handler or in the same on-chain block. And we don't currently have a way to know which entity versions were created first within a same block.

azf20 commented 2 years ago

And we don't currently have a way to know which entity versions were created first within a same block.

Is this still the case with order by lower(block_range) desc, causality_region desc, id asc?

leoyvens commented 2 years ago

Yes, that would be ordering entities within a same block and causality region by id, which might not match the insertion order.

amrap030 commented 1 year ago

Hello together, I am not sure if this is the right issue. I am trying to use the file data sources to store token metadata, but I am unable to load an entity from the store. I would like to do the following in a datasource handler after a token is minted and the corresponding entity was created:

export function handleMetadata(content: Bytes): void {
  const value = content.toString();
  let context = dataSource.context();
  let address = context.getString("address");
  let token = Token.load(address);

  if (token) {
    token.metadata = value;
    token.save();
  }
}

Unfortunately this doesn't work. The address variable is set correctly, the value variable correctly contains the ipfs content, but the loaded token is null.

I really need this feature for my master thesis, otherwise I have to look for a workaround. So is this the right issue to look for the current status of implementation?

graphprotocol / graph-node

Allow File Data Sources to update an entity #4087