mediachain / concat

Mediachain daemons
MIT License
42 stars 13 forks source link

Datastore Garbage Collection #64

Closed vyzo closed 7 years ago

vyzo commented 7 years ago

We currently have no mechanism for deleting objects from the datastore; this is partly by design, as we want the datastore to behave as an immutable append-only store, and partly by implementation constraints as tracking objects in the datastore is expensive (storage cost, performance effects on the ingestion critical path).

Nonetheless, we do need a mechanism for pruning the local datastore, for example when an operator decides to unmerge some large dataset.

Here are two possible approaches:

parkan commented 7 years ago

I really like approach 2, though another downside is that this requires complete write lock for the full run (or else a very clever way of handling implications of writes) and cannot be done piecemeal.

vyzo commented 7 years ago

There is also the direct approach of iterating through the datastore, but this depends on the effect of OptimizeForPointLookup on iterators.

It's not clear if they don't work at all, they return keys out of order, or they work slowly. But if we do have working iterators (eg with out of order traversal), we can then use a third approach: go through the statements in the db to find all referenced objects, and then iterate through the datastore to find all unreferenced objects, which are then deleted (it's unclear if deletion is possible while iterating), followed by datastore compaction.

parkan commented 7 years ago

@vyzo I think this is basically the approach that the IPFS GC process takes