Open whyrusleeping opened 5 years ago
This sounds like something that could be addressed with tracing, which go-log currently supports via opentracing.
We could first instrument the methods we want to collect timing information from with Start
and Finish
calls. This will give us the duration of each methods execution, and any relevant metadata along with it, such as peerID's and CID's. These logs will be viewable via the ipfs log tail
command, or by using a tracer like the go-jaeger-tracer plugin for ipfs.
If we want to collect traces spanning multiple peers (meaning we collect timing metrics of our local node and any other node participating in the request) we can use the serialize and deserialize methods to pass the context along with the request. This will likely require adding optional byte array fields to DHT and Bitswap protobufs and will require all nodes we wish to collect traces from be connected to the same tracer -- this will be tricky wrt infra.
@eefahy do you think we could run a jaeger tracer instance and feed it data from one of the gateways?
Yeah, absolutely. It might not be this week though unfortunately. Too swamped.
@lanzafame was just doing some of this tracing using opencensus v opentracing. Got any notes for us here Adrian?
If we want to collect traces spanning multiple peers (meaning we collect timing metrics of our local node and any other node participating in the request) we can use the serialize and deserialize methods to pass the context along with the request. This will likely require adding optional byte array fields to DHT and Bitswap protobufs and will require all nodes we wish to collect traces from be connected to the same tracer -- this will be tricky wrt infra.
We're not going to want to do this on the gateways (or any public node for that matter). However, we should be using tracing to get timing like this.
But remember, if this is going to happen, someone needs to own it and see it all the way to the end.
I would suggest OpenCensus (OC) as it provides both tracing and metrics, with the tracing output being compatible with OpenTracing collectors.
Also with OC, you can record request latencies on the gateways as well as server other measurements of http server stats via the ochttp plugin and then when testing you can turn tracing on as well to get the spans. The metrics can get exported to prometheus which we already have setup. OC also allows for packages to provide stats views that if registered by the user will be collected but if not registered then they won't be collected, see here for more details.
But remember, if this is going to happen, someone needs to own it and see it all the way to the end.
Yes please! This should also be tied to some project planning as whatever we learn from this tracing (now and in the future) will need to be a resourced and prioritized fix as well.
This might be obvious, but is worth mentioning. If you're planning on attaching request tracing to the gateways, using a sampling technique of some sort would be smart. I've stupidly blown things up by tracing all requests on even a low-volume service. Turns those contexts get very large quickly, clog the tubes. Tuning what to sample is the tougher bit. Starting with something in the vicinity of 1% total request volume makes for a sane starting point, and usually gives lots of data for initial thinking.
It'd be really nice to be able to attach tracing to specific peerIDs. Gateways would always enable tracing for requests that touch a given PeerID. Project mantainers could provision peerIDs known to the Ops team and configured their local machines send tracing spans via a secure backchannel to gateway loggers. That way you get a real image of how IPFS is behaving in production, combining tracing details across the network. If this existed we'd definitely configure our own gateways to do the same, logging to our tracing systems, and would happily share our findings.
Really, we want to trace requests from the web. However, we can probably find quite a few issues by just tracing our own traffic (by IP) instead of tracing everyone's traffic.
We need to start tracking gateway response latencies a bit better. Doing so will give us some really valuable information about where our efforts will be best applied in making ipfs faster.
For every request that comes in, unless we already have it cached, we generally spend some amount of time searching for peers who have the content, and then some amount of time connecting to said peers, and then finally some amount of time actually fetching that content.
The average request consists of a single hash. In this case, the first node gets resolved by
ResolvePath
. All of the content routing work and swarm dialing work happens during the execution of a single 'dag.Get' call. I suspect this is where the bulk of the latency comes from. Within dag.Get this work is further abstracted down into bitswap. Bitswap will search for peers, connect and then request data for each request internally.My approach would be to pass some sort of request ID down along with the context, hoping it makes it all the way to bitswap, then use that to tie log timing information about how much time is spent in which activities to a given request.
cc @Stebalien @hannahhoward @frrist