ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.17k stars 3.01k forks source link

context/tracing in the blockstore/datastore pipeline #6803

Closed MichaelMure closed 2 years ago

MichaelMure commented 4 years ago

At Infura, we have to deal with the occasional performance issue. Even though we now have added tracing in the HTTP handler of go-ipfs-cmds (not upstreamed yet, it's opentracing and you are aiming for opencensus if I'm not mistaken?), the tracing essentially stop there.

The majority of the requests (and the perf issues) involve the data pipeline but it is essentially a black box. To resolve that problem, a proper tracing instrumentation there would be very helpful, but that imply adding a go context to most if not all blockstore and datastore functions.

Is that something you would be interested to pursue ?

cc @dirkmc maybe ?

Stebalien commented 4 years ago

We should be passing contexts through to the datastores but the refactor is going to be quite large and invasive and will affect multiple projects (ipfs, libp2p, filecoin, and probably quite a few more).

We should look into this next year but I don't have time right now to handle all the fallout of this refactor.

MichaelMure commented 4 years ago

master issue:

Datastore:

Blockstore:

IPFS:

libp2p:

MichaelMure commented 4 years ago

Overall it went fairly well with just go-graphsync to be worked around with `context.TODO().

Obviously this will need to be merged progressively, tagged and updated on the upper layers, but it does compile for me and seems to be working from a quick test.

BigLep commented 3 years ago

@MichaelMure : We're sorry this has been open for so long. Do you think you'd be available to work with a PL engineer later in May get this updated and pushed? We'll figure out the exact schedule if you're available, but wanted to see if that is possible. Thanks!

MichaelMure commented 3 years ago

@BigLep I might be but I'll have to see how things evolve on my side.

However I'm not sure I would be so useful: the problem here is not technical. Ignoring go-graphsync for a minute, coding wise it's actually fairly easy. It took me a few hours to code and open all those PRs. The few packages I didn't touch are very likely to be as easy.

The problem here is the coordination of all the project and their maintainers to land those changes without too much complications and someone to orchestrate that effort. It's not something I can do as an outsider.

MichaelMure commented 3 years ago

@BigLep that said, let me reiterate that this would be a big win for us, but not only. We are often facing performance issue or simply something not working right, and go-ipfs being a black box make it hard to understand and fix. Tracing has proved to be invaluable in other part of our stack to provide a quality service and react quickly on incidents. This is likely the feature that would help us the most.

In addition, this would allow us to give you a better feedback from our deployed infrastructure, like pin-pointing precisely a performance issue. It's always easier to design or fix things with hard numbers.

BigLep commented 3 years ago

@MichaelMure : thanks! The need makes sense and agreed that a key dependency is the coordination from a maintainer to land these. I'm hoping we secure a time later in the month with someone like @aschmahmann who will do the merging but can also have you on hand if there rebasing/updates that need to be done along the way. Does that make sense (and please let me know if that's not an effective strategy)?

MichaelMure commented 3 years ago

@BigLep end of May is very much here now, is that still on your radar? On my side I'm still quite busy but this is a quite important topic.

BigLep commented 3 years ago

@MichaelMure : thanks for checking in. This is still on PL's radar. I'd like to do it after we get the IPFS 0.9 release out. This is taking longer than planned. I think the week of 2021-06-07 is more likely. Any chance that would work for you?

MichaelMure commented 3 years ago

That looks like a reasonable plan, although I have to mention that I can't guarantee it.

BigLep commented 3 years ago

2021-07-01 conversation:

  1. We are thinking we should do a Go major bump for these packages.
  2. More thought/planning needs to be done around the names/versions for all the packages

Next steps: @aschmahmann create a new issue for creating the name/version plan.

BigLep commented 3 years ago

@MichaelMure : my apologies for not responding by end of last week on where we're at with this.

We discussed this internally some last week. We realized we need to write up a plan for how we plan to name/version these packages. Giving more thought here to names/versions may enable us to merge in a more progressive way (rather than the stop-the-world until everything is merged approach like we were originally considering). @aschmahmann is going to create the issue that describes what needs to be done here. At the minimum, for go-ipfs 0.10 we want to have this plan articulated. We are giving priority though to improving our ability to release quickly and likely won't have implemented the plan by then.

We'll update here when we have this issue. We're tracking creating the issue for the plan here

BigLep commented 3 years ago

Update on where this is at. This won't happen for at least another couple of go-ipfs releases. It's currently tagged for 0.13, although the scope of that release is still TBD. We know it's not in 0.11 or 0.12. The PL team's priorities have been to focus on our release velocity (but automating and simplifying the release process) and operational issues (in part related to the ipfs.io outage discussed here.

BigLep commented 3 years ago

@MichaelMure : quick update here that I'm trying to see about prioritizing this issue sooner as PL is focusing on its own efforts to improve the Gateways it's operating. @guseggert is a newer team member who may assign this to. Could you two connect and swap notes about the intended ways this will be helpful? I'll use that info for making the prioritization case.

MichaelMure commented 3 years ago

Sure, here is a few things where that could be helpful:

tracing/observability

This would be a massive help for running IPFS in production. Really, I can't overstate that. At the moment, go-ipfs is basically a black box: requests comes in, response comes out. What happen in the middle is hard to observe. There is prometheus metrics, the diag command or pprof but that's really surface level. This leads to difficulties to address emergency situations, understand and plan for the future.

The most critical subsystem in go-ipfs for performance and day to day operation is the data pipeline. Having a go context in there would allow to gradually and possibly independently add tracing instrumentation. The benefits would be:

Note also that this tracing would not necessarily be limited to go-ipfs: distributed tracing allow to propagate this tracing over the boundaries of connected systems (proxy, backend ...) which gives another dimension of observability.

handling cancellation / reliability

At the moment, there is no cancellation in the data pipeline. This means that once a request is started, nothing will stop it, even if the original request is gone. Fixing that would allow to trim that unnecessary fat, reduce the load and improve reliability. It might also prevent a form of attack where heavy handling is triggered with minimal effort.

request tagging / custom handling

Another way this could be helpful is that the go context allow to carry metadata about the request, independently of all the layers it's going through. This means that one could for example carry over a domain specific logger or tag a request with an origin or some customer information. When reaching a lower level, those information could be used to prioritize requests, do caching differently, route to a specialized backend ...

Importantly, this mechanism allows to extend the system independently without being bound by the Protocol Labs road map. This would lower your burdens and allow to explore more areas.

engineering feedback / optimisation

In my experience, each time observability improve, new issues are unearthed. Proper instrumentation would give node operator and PL the tooling to discover those long standing performance/reliability issues. 95% of the work in software engineering is to figure out where the problem comes from. Once that's done and understood, fixing stuff becomes easy.

node operator autonomy

Observability would allow node operator to more easily figure out what is happening and in turn, rely less on PL to diagnose those issues and reduce the burden on the development team.

Also discussed a bit at https://github.com/ipfs/roadmap/issues/74

BigLep commented 3 years ago

Well stated @MichaelMure . PL will respond back by EOD 2021-10-01.

BigLep commented 3 years ago

I missed not circling back on this last week since we did discuss it internally. @guseggert and team are going to pick this up. We're aiming to get this in the next go-ipfs release 0.11. The first step is to make the plan of how we can merge this in a progressive way.

BigLep commented 3 years ago

Will get more of the plan public, but internal scratchpad for thoughts on rolling this out is happening here: https://www.notion.so/protocollabs/Context-Plumbing-2b9fccf60db34ecb980b3068cabb9d50

guseggert commented 3 years ago

Update: I'm plumbing these changes through as pseudoversions on branches, to make sure it all works before publishing any new versions.

I will probably add some contexts in additional places, e.g. there are some interfaces in go-datastore that should have contexts too like CheckedDatastore, ScrubbedDatastore, GCDatastore, etc.

Here's the order to update the modules:

(script to generate the order: https://gist.githubusercontent.com/guseggert/fe079f793cbea3158538bdaa9f50878b/raw/d87c0ef9f1593dd7ce9acb0b38e003e9f455ba88/gistfile1.txt)

guseggert commented 3 years ago

Update: currently working through libp2p/go-libp2p-swarm as it depends on an older version of go-libp2p-peerstore, and the newer version has non-trivial backwards-incompatible changes.

(This module was not originally in the list, it was added after I fixed the script to generate the list.)

BigLep commented 3 years ago

2021-10-26 note:

  1. @guseggert will update this
  2. We'll link the new PRs
  3. We'll close out the old ones
guseggert commented 3 years ago

I've plumbed the changes through using pseudoversions on feat/context branches through all the repos, and fixed the issues that came up. I've added contexts to quite a bit more interfaces, so I re-did the plumbing work. Now I'm beginning to cut releases and plumb those through. Libp2p has some hole punching changes that are in-flight, and there are also sharding changes in-flight that could cause issues with the rollout here--if I run into any problems, I'm going to plumb through a pseudoversion instead of a release version, and that can be cleaned up separately after the issues are resolved.

BigLep commented 3 years ago

Thanks for the update @guseggert ! A couple of things I think would be useful for visibility when you can get to them:

  1. A list of all the repos we're going to touch (and in what order) with checkboxes to show current status?
  2. The PRs next to the list as they get created
guseggert commented 3 years ago

I am also adding the versioning workflows to all of these repos, which is taking some time to roll out (rerunning flaky tests, approving PRs, etc.).

Here are the repos (ordered):

guseggert commented 3 years ago

I've finished the majority of the plumbing, the rest are blocked on two things:

Once those are resolved, I can complete the plumbing and we will be ready to release go-ipfs v0.11.0-RC1

guseggert commented 2 years ago

I discovered yesterday that go-ipfs-blockstore is using the wrong version of go-ipfs-ds-help, it was inadvertently upgraded to v1, so I need to go back and downgrade go-ipfs-blockstore@v0 to use go-ipfs-ds-help@v0 and then re-plumb.

aschmahmann commented 2 years ago

closed by #8563