Open azf20 opened 2 years ago
@lutter what kind of thing do you have in mind here, changing the way in which arrays are stored in the database?
Yes, ultimately we need to avoid storing large arrays because they interact very badly with how we handle versioning/time-travel by storing the entire entity for each version. There's two approaches I can think of: (1) Refuse to store arrays that are larger than n entries, i.e., fail the subgraph when that happens (2) Behind the scenes, store array entries in their own table and change the query logic to reassemble the array when an entity is queried.
For an entity
type Account {
id ID!
txns [Transaction!]!
}
we would have an additional account_transaction
table with columns (account_id, txn_id, block_range)
where the block_range
says 'in this range, that txn was part of that account's Account.txns'
Some questions around that:
in this range, that txn was part of that account's Account.txns
That's a brilliant idea. I don't think this would cause problem for entities dealing with small arrays, 5 entries is not that small really.
The complication with arrays is that they preserve insertion order, so we'd need to store the index of the entry and update those indexes when the array is mutated. Or represent the ordering as a linked list. At which point we'd be tempted to come up with clever diffing algorithms to optimize for different array update patterns.
But many uses cases do not care about the ordering and could represent their arrays as sets. Which can be nicely represented with the relational table. So it seems best introduce set types, which could have different flavours such as grow-only sets, and use that optimized representation for them.
The complication with arrays is that they preserve insertion order, so we'd need to store the index of the entry and update those indexes when the array is mutated.
If the mapping table has a vid
, we can just sort by vid
, since that will return entries in insertion order.
@lutter That is efficient in the case of push-only arrays. If we try to use that for arrays that can be arbitrarily modified then it becomes complicated.
interesting. @lutter would this have an additional beneficial effect for queries in general (I think large arrays have caused us problems with query nodes in the past?)
Looks like this issue has been open for 6 months with no activity. Is it still relevant? If not, please remember to close it.
We really need to do something about large arrays - there are some grafts in shard_i involving subgraphs with large arrays where a batch to copy data took over an hour when each batch should really only take 5 minutes. This PR tries to mitigate that, but a lot of the data we store comes from large arrays, and we need to do something better there.