ipfs / go-graphsync

Initial Implementation Of GraphSync Wire Protocol
Other
100 stars 38 forks source link

Debug graphsync responder memory retention issues #256

Closed hannahhoward closed 2 years ago

hannahhoward commented 2 years ago

We are seeing ongoing issues with Estuary and memory retention of blocks when responding to go-graphsync requests. pprof.estuary-shuttle.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz pprof.estuary-shuttle.alloc_objects.alloc_space.inuse_objects.inuse_space.003.pb.gz

The hot path goes through queryexecutor.go and runtraversal.go, but it's not clear why the blocks that are loaded are retained.

After the blocks are loaded, there are two paths they go on:

  1. They get sent back to go-ipld-prime's selector traversal code as the data returned on link load. https://github.com/ipfs/go-graphsync/blob/main/responsemanager/runtraversal/runtraversal.go#L69
  2. They get sent to the ResponseAssembler and then MessageQueue to go over the wire. https://github.com/ipfs/go-graphsync/blob/main/responsemanager/runtraversal/runtraversal.go#L77 https://github.com/ipfs/go-graphsync/blob/main/responsemanager/queryexecutor.go#L97 (sidebar: RunTraversal is getting to be a somewhat bizarre and unneccesary abstraction, and I wonder if we should just but it back in the query executor)

We have fairly extensive code intended to back pressure memory usage in traversal for the path through the MessageQueue. The Allocator SHOULD block the second code path, which is synchronous, preventing more blocks from being loaded off disk until the messages go over the wire, at which the block memory SHOULD be able to be freed in a GC cycle.

So far, most of my efforts have focused on the second code path, and verifying that the allocator is blocking the traversal properly, and that block memory is in fact being freed upon being sent over the wire. As of yet, I have been unable to replicate memory issues due to issues on this code path similar to those witnessed in Estuary. Graphsync has an extensive testground testplan with lots of parameters, and you can see my experiments in https://github.com/ipfs/go-graphsync/tree/feat/estuary-memory-debugging

I have not explored the first code path cause we haven't witness issues in it up to this point, but I think it is worth examining.

I am not sure the best next steps, but I think debugging this particular issue is top priority.

rvagg commented 2 years ago

(sidebar: RunTraversal is getting to be a somewhat bizarre and unneccesary abstraction, and I wonder if we should just but it back in the query executor)

that's what I've been thinking while looking through this, but merging them back together as they are is going to make testing harder, so it'll need to be done in a way that doesn't increase the difficulty; I'll include this as an option in what I'm working on now with queryexecutor.

hannahhoward commented 2 years ago

One thing to add to the testing mix. We understand that NFT.storage and other sites use https://github.com/filecoin-project/go-dagaggregator-unixfs/tree/2b317c005fd561156cac93924ffeb31c58fdb966 to assemble their dags. It seems like we could add an option to assemble a dag using this code to the testplan and see what we learn.

hannahhoward commented 2 years ago

resolved in #268