Large arrays take a long time to copy during grafting

azf20 commented 2 years ago

We really need to do something about large arrays - there are some grafts in shard_i involving subgraphs with large arrays where a batch to copy data took over an hour when each batch should really only take 5 minutes. This PR tries to mitigate that, but a lot of the data we store comes from large arrays, and we need to do something better there.

azf20 commented 2 years ago

@lutter what kind of thing do you have in mind here, changing the way in which arrays are stored in the database?

lutter commented 2 years ago

Yes, ultimately we need to avoid storing large arrays because they interact very badly with how we handle versioning/time-travel by storing the entire entity for each version. There's two approaches I can think of: (1) Refuse to store arrays that are larger than n entries, i.e., fail the subgraph when that happens (2) Behind the scenes, store array entries in their own table and change the query logic to reassemble the array when an entity is queried.

For an entity

type Account {
  id ID!
  txns [Transaction!]!
}

we would have an additional account_transaction table with columns (account_id, txn_id, block_range) where the block_range says 'in this range, that txn was part of that account's Account.txns'

Some questions around that:

this feels like overkill for small arrays (say with max 5-10 entries) Should users tell us explicitly that some array might become large?
it would be helpful here too if we knew that entries are never removed from an array because we could do similar optimizations like we did for immutable entities

leoyvens commented 2 years ago

in this range, that txn was part of that account's Account.txns

That's a brilliant idea. I don't think this would cause problem for entities dealing with small arrays, 5 entries is not that small really.

The complication with arrays is that they preserve insertion order, so we'd need to store the index of the entry and update those indexes when the array is mutated. Or represent the ordering as a linked list. At which point we'd be tempted to come up with clever diffing algorithms to optimize for different array update patterns.

But many uses cases do not care about the ordering and could represent their arrays as sets. Which can be nicely represented with the relational table. So it seems best introduce set types, which could have different flavours such as grow-only sets, and use that optimized representation for them.

lutter commented 2 years ago

The complication with arrays is that they preserve insertion order, so we'd need to store the index of the entry and update those indexes when the array is mutated.

If the mapping table has a vid, we can just sort by vid, since that will return entries in insertion order.

leoyvens commented 2 years ago

@lutter That is efficient in the case of push-only arrays. If we try to use that for arrays that can be arbitrarily modified then it becomes complicated.

azf20 commented 2 years ago

interesting. @lutter would this have an additional beneficial effect for queries in general (I think large arrays have caused us problems with query nodes in the past?)

github-actions[bot] commented 1 year ago

Looks like this issue has been open for 6 months with no activity. Is it still relevant? If not, please remember to close it.

graphprotocol / graph-node

Large arrays take a long time to copy during grafting #3515