Closed jacobheun closed 1 year ago
Thank you for submitting this PR! A maintainer will be here shortly to review it. We are super grateful, but we are also overloaded! Help us by making sure that:
The context for this PR is clear, with relevant discussion, decisions and stakeholders linked/mentioned.
Your contribution itself is clear (code comments, self-review for the rest) and in its best form. Follow the code contribution guidelines if they apply.
Getting other community members to do a review would be great help too on complex PRs (you can ask in the chats/forums). If you are unsure about something, just leave us a comment. Next steps:
A maintainer will triage and assign priority to this PR, commenting on any missing things and potentially assigning a reviewer for high priority items.
The PR gets reviews, discussed and approvals as needed.
The PR is merged by maintainers when it has been approved and comments addressed.
We currently aim to provide initial feedback/triaging within two business days. Please keep an eye on any labelling actions, as these will indicate priorities and status of your contribution. We are very grateful for your contribution!
Summary
This is a patch on top of 0.13.2. While debugging network disconnect issues in Boost during retrieval we discovered a leak in go-routines for graphsync. The issue is that a hard network disconnect may still result in the MessageQueue being restarted if there are messages the server is still attempting to send.
This change does not fully fix the issue but after multiple runs of force disconnecting 500 retrievals, we saw a 10x reduction in open goroutines. More robust handling for hard network disconnects may be warranted in the 0.14.x line, but that is also likely a non trivial effort. This gets us most of the way.
Goroutine dumps from boost
The below were goroutine diffs between startup and after executing 500 forceful disconnects.
Before
After