ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.04k stars 3k forks source link

better gc #3679

Open whyrusleeping opened 7 years ago

whyrusleeping commented 7 years ago

I have a large enough ipfs repo (over 300 million blocks) that it takes a very large amount of memory to perform a gc. computing the marked set is expensive.

I'm thinking that using something like a bloom filter could make this process use up much less memory, at the expense of not cleaning out every block. The difficulty here is that false positives while enumerating the set of pinned objects could result in drastically lowered performance (we could accidentally think a block that points to everything is pinned and end up cleaning out nothing) so selecting parameters to try and avoid this is important.

Another (potentially more complicated but more accurate) option is to use a disk backed prefix tree to store the enumerated pinset (with heavy caching up to some memory limit to make perf tolerable). This just offloads the memory cost of storing the sets to disk, which is generally acceptable, but can be an issue as it would prevent people from running a GC if their disk was full, this is generally considered to be a bad thing.

I'm also interested in strategies that can do "a little gc". Something that allows us to quickly free a smaller subset of the blocks, without the overhead of performing an entire gc scan. Implementing something that lets us do this may require a rethinking of how pinsets and objects are stored.

Kubuxu commented 7 years ago

I think stored, re-scalable bloom filer could be a good way to go. We already use bloom filter for Has request filtering. Both GC and Has caching would benefit from it.

kevina commented 7 years ago

Another disk based alternative is to use two sorted lists. First write out a list of the colored set, then sort it (duplicates can be removed during the sorting phase). Then create another list of all keys that are candidates to be removed and sort it. Then simply compare the two lists. The second list is not strictly needed as a binary search can be performed on the colored-set list, but will help with performance as the keys are currently not guaranteed to be returned in any order.

As just a way to reduce memory. Forgo any set data-structures and just use a list as before, but keep it memory. A sorted list is bound to be more memory efficient than any any deterministic set data-structure.

By not using a set for the colored list, there is a risk of doing duplicate work (as you can't instantly check if a node has already been visited). This may or not be a problem.

Just an idea.

whyrusleeping commented 7 years ago

there is a risk of doing duplicate work (as you can't instantly check if a node has already been visited).

In practice, this is a huge cost. Very frequently we have situations where there are 100 or so nodes that point to the same large graphs (example, the go-ipfs source directory in my repo that i add to ipfs very frequently)

whyrusleeping commented 7 years ago

Though I do like the idea of pushing the colored set to disk as a sorted list (alternatively, a binary search tree, if you want to do heap style indexing)

whyrusleeping commented 7 years ago

@lgierth you were wanting to discuss better garbage collection strategies. Want to weigh in here?

whyrusleeping commented 7 years ago

I think its worth us adding an on-disk bloom filter, we can make it quite large so the false positive rate is low enough, and just keep it up to date as new pins are added.

I would also like to have an LRU cache of recently used objects so we can avoid garbage collecting things we might be working with but havent pinned yet. It could even be time based (don't gc any objects created in the past hour?)

ghost commented 6 years ago

Dumping this tweet thread here https://twitter.com/TimSweeneyEpic/status/1019008128933924869

@ TimSweeneyEpic

Thinking about garbage collection: is there any graph theory analog to the Fourier Transform? A new space derived from the objects-and-references space which enables asymptotically more efficient determination or update of liveness? Bloom filter mojo?

@ F_Vaggi

To the first question: yes. The reference is here: https://ieeexplore.ieee.org/document/6494675/ There's also tricks for computing the Fourier transform on graph faster.

‏@ pragmatism

Unless your graph is time variant, you can express it as a state space filter: https://ccrma.stanford.edu/~jos/fp/State_Space_Filters.html Old school control sys folks use them a lot. Once you get your graph in a state space representation you can gauge its frequency response

@ stephenjudkins

I've wondered about this. I suspect non-deterministic GC could be a fruitful research program; it's really not a problem if a certain % of the space is wrongly marked as live at any given point, as long as the chance converges to 0 eventually

@ Elrohir_Ancalin

The FT is the SVD of a Toeplitz matrix. Maybe if your incidence matrices had a specific structure there would be a transform for it (wavelets?). But I don't think there can be a general form for "all graphs" since the set of all arbitrary matrices includes the uncompressible ones

@ NicTVG

My take away: In terms of asset management, LRU is simple but fails for periodic events like skills on a long cooldown which must be loaded ahead of time. I can see a direct use of DFT to build a prediction over time instead of custom code to keep important assets in memory.

dbaarda commented 6 years ago

A quick note on alternatives to LRU for asset caching; I came up with something for a similar application;

http://minkirri.apana.org.au/wiki/DecayingLFUCacheExpiry

It should be only a tiny bit more complicated than LRU, but should handle scans better.

ianopolous commented 5 years ago

It is probably very useful to study the current state of the art in GC algorithms, which are all on the JVM. They can handle ridiculously large working sets (many TiBs), with full concurrency/pauseless in fixed overhead and other great features. The top 3 would be

Kubuxu commented 5 years ago

@ianopolous problem is, all/most collectors for Java are moving which is much harder to do when your dataset is on disk.

ianopolous commented 5 years ago

@Kubuxu Yep, I'm aware of that, but you could easily do that with multiple datastores if needed. Though I don't think that is critical to many of their ideas. E.g. ZGC is a single generation GC, so the only moving there is compaction. Naively I would guess that you could just noop that part with the rest of the algorithm mostly unchanged.

However you could also use a datastore which was trivially a single large file, with a mapping from Cid to offset, and then you could use one of their algorithms "unchanged". I'm not a GC expert, but I'm sure there are useful ideas here in the state of the art.

Stebalien commented 5 years ago

Programming language GC systems aren't really the correct target. We should look at content-addressed/deduplicating filesystems (almost all of which use reference counting).

ianopolous commented 5 years ago

@Stebalien Do any of those filesystems have the concept of deep graphs of links between parts of files like ipfs's ipld? More than just a single layer of, say, metadata linking to data fragments (I don't know)? If not, then the topology of a managed language's object graph would seem more similar to me, and chasing that graph is the main task of a GC. The main difference I can see is the absence of circular graphs in a merkle tree. Regardless, I just wanted to point to a potential source of good ideas.

Stebalien commented 5 years ago

I've heard of at least one that uses merkle-links for directories (so directory copies are free). But the one I'm thinking of is a networked filesystem. We'll probably want to look at network-based filesystem papers for inspiration.

My concern with programming languages is that:

  1. They need to handle cycles and that tends to introduce a bunch of complexity for a feature we don't need.
  2. Walking all pointers in memory occasionally is expensive but not that expensive. In our case, traversing pinned content could take days.