carllerche commented 5 years ago

Tokio provides an instrumentation API using Tokio Trace as well as a number of instrumentation points built into Tokio itself as well as the Tokio ecosystem. The goal of the project is to implement a subscriber to these instrumentation points that exports the data out of the process via a gRPC endpoint. Then, implement a console based UI that connects to the process and allows the user to visualize the data.

Expected outcomes

A Tokio Trace subscriber that listens to all events
A gRPC server that exports instrumentation data to connected clients.
A console based UI that allows users to query and visualize the data.

Skills

Rust
gRPC

Difficulty level

Medium

matprec commented 5 years ago

Libraries for TUI:

tui-rs

Backends:

termion: windows support is WIP, see respective PR
rustbox: currently wraps c-lib termbox (static linking)
crossterm: missing mouse support
pancurses: c-libs: ncurses (dynamic linking, linux), pdcurses (static linking, windows)
Cursive

Backends:
ncurses-rs: c-lib: ncurses (dynamic linking)
pancurses: c-libs: ncurses (dynamic linking, linux), pdcurses (static linking, windows)
termion: windows support is WIP, see respective PR
BearLibTerminal: c-lib: bearlibterminal (dynamic linking)

Thoughts on backends

There is no silver bullet: Either platform/feature support is missing, or requires installing external libraries. Though that choice could just be exposed and left to the user via feature flags. In that case, care has to be taken to make the UI usable with keyboard support, with mouse support being optional.

`tui-rs` vs `Cursive`

Both seem actively developed, though Cursive seems more used (crates io downloads, commits, contributer). Cursive seems to have a more traditional styling, while tui-rs has a more modern default look.

carllerche commented 5 years ago

@MSleepyPanda Thanks for the incredible survey 👍

Just looking at tui-rs vs. Cursive, the tui-rs demo looks really nice.

matprec commented 5 years ago

I'm working through the issues and PR's labeled with tokio-trace to get up to speed with the current implementation. I'd also like to collect a list of useful debug scenarios. @carllerche you already mentioned hanging tasks on gitter, are there user reports/issues which would be useful to analyze? Best case with repro/debugging timeline, for analyzing user needs in the respective scenario.

I think it makes sense to categorize scenarios into the following groups:

User error: Debugging system interaction (e.g. misconfigured timers, request routing, unscheduled tasks)
Implementation error: erroneous future implementation, bugs in tokio itself (issues labeled bug)
Miscellaneous/Tooling: latency / throughput analysis, self profiling

Also, documentation is really important here! Tools are nothing if they cannot be found or understood.

Debug tutorials to get people familiar with the tool
Performance / feature comparison with traditional debuggers and log
Implementation guides for 3rd party library authors

carllerche commented 5 years ago

@hawkw Maybe you have thoughts on what the quickest way to get up to speed on tokio-trace would be?

you already mentioned hanging tasks on gitter, are there user reports/issues which would be useful to analyze

Unfortunately, I have not personally kept track of these bugs. I can try to describe some scenarios.

Returning `NotReady` without "registering" the current task

The most obvious cause for a hang is a Future implementation that returns NotReady without actually calling task::current() and doing something w/ the current task. In this case, the system has no way of ever being notified.

The way to track this would be for Tokio to be able to hook to hook into the instrumented future as well as hook into task::current. Tokio is currently able to do both.

So, if the instrumented future returns NotReady but task::current() or task::will_notify_current() are not called, this is a bug.

This detection should only happen in debug mode as it would be expensive.

Tracking what the future is waiting for

Continuing from above, it is possible to get an idea about what each future is waiting on by watching calls to task::current on each call to the root future's poll function. Then, track its dependencies.

For example, if Task a calls mpsc::Receiver::poll, that implies it depends on the Send half of the mpsc. This can establish a dependency on the task (or tasks) that hold Sender handles.

So, in the console, you could see "oh, Task A is blocked, let me list all its dependencies"... and see the tasks that hold the sender handles. Then, from there, you can figure out why those tasks aren't completing.

If tokio is fully instrumented, you could get info like "the task is waiting for data to be received on this TcpStream".

Part of this ties into the structured logging capabilities of tokio-trace. Once you identify which task is causing problems, you can use tokio console to display the logs for just that task.

Initial Release

There are a huge number of things we can do with Tokio Console and all the data that Tokio Trace will unlock. It will be important to stay focused on an initial release and go from there.

I think the first thing that would be a big value add is to be able to isolate logs by task. Right now, if there are more than one concurrent tasks, it is basically impossible to figure out what is going on in the logs. In scenarios w/ two connections proxying HTTP/2.0 data, the current logging situation is unusable.

The very first useful feature should be Tokio Console writes logs to the terminal w/ a task ID assigned to each log entry. Then, you can filter by task ID.

The next step would be to allow spawning tasks in Tokio w/ a name assigned to it. This would provide additional data in the console.

We can go from there.

hawkw commented 5 years ago

Re: tui-rs vs. Cursive, my gut feeling is that dynamic linking with ncurses (the approach for most of Cursive's backends) is probably going to be more painful to support (different user environments might have different ncurses versions and so on). On the other hand, ncurses might have better mouse support than the other backends...

I also noticed that Cursive has a wiki page comparing it with tui-rs, where they suggest that tui-rs "is well suited for flat dashboards, and other often-refreshing applications with little interaction.", which seems to describe our use-case.

jonhoo commented 5 years ago

It'd be really cool if Tokio's tracing infrastructure also provided USDT ("Userland Statically Defined Tracing") trace points, which allow you to hook into powerful existing tracing and profiling tools like eBPF. When a program contains USDT points, it can be easily instrumented using a tool like bpftrace (more info), and this can provide much richer profiles than can be obtained with perf and the like.

There has been some effort (https://github.com/rust-lang/rfcs/issues/875) to get USDTs into the Rust compiler directly, but for the time being it looks like libprobe by @cuviper is the way to go. It requires nightly for asm, but that's about it. There's also some work over at rust-usdt, but as far as I can tell that particular effort was discarded in favor of pushing towards compiler integration.

hawkw commented 5 years ago

@jonhoo I definitely agree that it would be great to get USDT integration in tokio-trace, but I'm not sure if it falls within the scope of this ticket?

jonhoo commented 5 years ago

@hawkw probably not, but @carllerche asked me to, so I did :p

carllerche commented 5 years ago

Getting ideas out is good. There may be more than one student interested too.

hawkw commented 5 years ago

@carllerche oh, for sure! I just thought tokio-trace/USDT integration just seems something that deserves its own separate project?

carllerche commented 5 years ago

Here is a hang report: https://github.com/tokio-rs/tokio/issues/965

TimonPost commented 5 years ago

I have two comments:

GRPC Although I am questioning if we should tunnel vision onto GRPC, like there are scenario's where a person might want to read data with C#, use InfluxDB for graph support, have the metrics in an AMQP like RabbitMQ, Redis etc.

Maybe we could think of some abstraction that allows the user to specify how to handle the data. A default implementation from Tokio could be GRPC.

"crossterm: missing mouse support"

I am the owner of crossterm, currently, there is a PR of mouse and more key input support. This will be finished any time soon.

carllerche commented 5 years ago

@TimonPost Tokio-console is a layer on top of Tokio-Trace. Tokio-trace is decoupled from any subscription / export format. So, anyone can implement a Tokio-Trace subscriber that exports to influxdb.

The Tokio-console project is to implement a subscriber that exports via grpc. I don’t think that it would make much sense to abstract it more.

hawkw commented 5 years ago

@TimonPost

I have two comments:

GRPC Although I am questioning if we should tunnel vision onto GRPC, like there are scenario's where a person might want to read data with C#, use InfluxDB for graph support, have the metrics in an AMQP like RabbitMQ, Redis etc. Maybe we could think of some abstraction that allows the user to specify how to handle the data. A default implementation from Tokio could be GRPC.

The data used by the tokio console will be generated by tokio-trace instrumentation. tokio-trace is intended to be flexible and extensible; it has a pluggable Subscriber interface allowing users to specify how trace data should be collected and recorded. The expectation is that eventually there will be libraries providing Subscriber implementations for reporting tokio-trace data to InfluxDB, and AMQP, as well as Prometheus, OpenTracing, Splunk, Zipkin, etc.

You're certainly correct that gRPC should definitely not be the only exposition format available for tokio-trace data. However, this goal of this project is specifically to implement a command-line console application for monitoring data exported by tokio-trace, and gRPC seems like a good fit for this use-case in particular. The use of gRPC here doesn't preclude the implementation of subscribers for InfluxDB or AMQP.

Edit: looks like Carl beat me to it... :)

hawkw commented 5 years ago

@TimonPost Oh, I realised I should also add that if you're interested in writing tokio-trace Subscriber implementations for InfluxDB or AMQP, please don't hesitate to reach out!

I'd really like for there to be more subscriber crates available, and I'm happy to provide guidance and advice on how to write one (especially since there aren't yet a lot of existing implementations to base a new one on).

TimonPost commented 5 years ago

@carllerche @hawkw alright, thanks for clarifying, must have understood the goal of this task wrong. If GRPC is only used for a console logging then I think it is great.

I was going to investigate into the mio-windows issue for GSOC, however, I could take a look into Tokio trace AMQP too, unfortunately, the time I have is a bit low now, lot's of cool projects I have to do some work for already :), I'll let you know when I have a little more time to investigate into that, seems fun to do, thanks again.

hawkw commented 5 years ago

@TimonPost The AMQP subscriber is not particularly urgent; I just thought I'd mention it in case it was something you wanted to see right away. So it definitely doesn't have to be your top priority. Let me know if or when you are interested in hacking on something like that, though, and I'd be happy to help you get started!

matprec commented 5 years ago

Quick update: Exams are currently eating up my time, but next week i'll dig into implementing a toy subscriber logging to console. Next step will be moving logging to a simple tui (probably tui-rs), but within the same process. This will give a good overview of the requirements, interactions and scope of the project.

matprec commented 5 years ago

Update Crossterm is developing mouse input handling support at the input_handling branch. Once that merges into master, tui-rs will have to update to that version. Also i intend to use that specific combination, but for now i'm using pancurses.

I've implemented a demo-ish subscriber and verified that i'm getting the equivalent output by running it side by side with tokio-trace-fmt.

Organization I suppose that the project will live under the tokio umbrella. Is the plan to use a dedicated repository or will it live within tokio or tokio-trace-nursery?

Tokios minimum supported rust version is 1.26.0. Is it a requirement for the project as well? Policy for tests is that they can use never versions. I think that for end users it would make most sense to be able to compile the project with the same compiler used for tokio, even though we could distribute binaries.

Naming I'd stick with tokio-console for now, so we can decide if we want to rename it while we go. I like Inspector but i haven't settled on anything yet :)

hawkw commented 5 years ago

Update I've implemented a demo-ish subscriber and verified that i'm getting the equivalent output by running it side by side with tokio-trace-fmt.

Nice! Let me know if you need any feedback on your code --- I'm happy to take a look.

Organization I suppose that the project will live under the tokio umbrella. Is the plan to use a dedicated repository or will it live within tokio or tokio-trace-nursery?

I think we'll probably want to make a new repo, maybe tokio-rs/console.

Tokios minimum supported rust version is 1.26.0. Is it a requirement for the project as well? Policy for tests is that they can use never versions. I think that for end users it would make most sense to be able to compile the project with the same compiler used for tokio, even though we could distribute binaries.

We will definitely want the subscriber implementation to be compliant with the minimum version policy if possible, since it would be included in user code as a dependency. However, I think this is less important for the console UI application, since it can just be distributed as a binary. Perhaps @carllerche has a different opinion, though.

matprec commented 5 years ago

@hawkw Feedback would be nice! I've opened a pull request at my repository for a review. Organization wise, would this count as a patch for the application?

TimonPost commented 5 years ago

@MSleepyPanda, crossterm input handling is almost finished. And is probably released this or next week. Here is the tracking issue

EDIT

Crossterm does support input handling in and is available in 0.8.0.

matprec commented 5 years ago

~Potentially, i've just found an issue with severe consequences:~

Thought experiment: What if we wanted to instrument a subscriber implementation with tokio-trace, maybe the event record method? Wait that would result in an endless loop since each call to event would cause another event, that would be silly and eventually blow up the stack. That wouldn't pass review.

The proposed design would result in a very thin subscriber anyway, since that would probably just pipe through calls, e.g. to event, via gRPC to the console process, so why bother?

~Because the conclusion from the thought experiment is not, just don't instrument your own code, but *you must not use any dependency that is ultimately instrumented via tokio-trace or even log()**! The crux is the 1:n relationship that builds up for each event. So even if you'd just send messages to another thread via channels, which then sends everything, you'd cause events there which will be instrumented as well.~

(*): Log compability is optional via feature flag, but since cargo features simply stack up, just one dependency declaring the compatibility feature suffices to enable it for everyone.

~This rules out especially tower-gRPC~, since it is instrumented via log. But even if one would carefully choose dependencies not using the log compatibility feature, eventually tower-gRPC uses tower, which uses the tokio-* crates, which will be instrumented via tokio-trace, closing our circle and ending up in recursion.

Are mitigation strategies possible?

Not declaring interest in the respective callsites would work, but that means that those dependencies could never be instrumented in applications using them on their own
Just don't (carefully) use (other, not instrumented) libraries then and stick to a bare bones implementation with std primitives might work out
Extract the information via shared-memory/abuse stdio and let the sender live in a child process, or other kinds of IPC

I don't see an easy solution here and i sincerely hope that i'm just missing something here.

hawkw commented 5 years ago

@MSleepyPanda that's definitely a reasonable concern, but I think there's another way around it that you're missing:

We would probably want the thread that actually runs the grpc server to be outside of the subscriber context entirely (so we wouldn't wrap it in a dispatcher::with_default(...)). This means that any events occurring on the gRPC server would never make it to the Subscriber. The Subscriber implementation itself could be pretty lightweight and just forward events to the gRPC server thread using a channel that isn't instrumented.

Alternatively, if we need to run the gRPC server task on the same executor as the rest of the application instead of spawning a dedicated thread, we could have some thread-local flag which gets set every time we enter the subscriber (i.e. by a call to event, etc) indicating that we are now inside the subscriber context (we could even do this using a tokio-trace span). Any events etc that occur while inside this context can be ignored. This way, the callsites in libraries can still be enabled when the application hits them but they're ignored when inside the subscriber.

matprec commented 5 years ago

@hawkw Oh i have missed that one could spawn a thread outside of the subscriber context. I was still thinking in terms of the log implementation, where the Logger would have been set globally. That might be the solution.

hawkw commented 5 years ago

@MSleepyPanda yeah, unlike log, tokio-trace's Subscribers are scoped. Part of the motivation behind that is for cases like this!

hawkw commented 5 years ago

@MSleepyPanda actually, now that I give it a little more thought, we don't even need to run the gRPC server on a dedicated thread --- we can run it on the same executor as everything else, if we use the with_subscriber combinator from the tokio-trace-futures crate on the gRPC server future to set the subscriber to a no-op subscriber. That way, when the gRPC server's future is being polled, it switches the subscriber context automatically to a no-op subscriber, and when it yields, we switch back to the console subscriber. We'd just need to ensure any other futures the gRPC server spawns are also instrumented by a no-op subscriber.

We could even use a different subscriber (such as tokio-trace-fmt) here if getting traces out of our gRPC server is necessary for debugging purposes.

carllerche commented 5 years ago

In general, we almost certainly do not want subscribers instrumented w/ tokio-trace to dispatch events to themselves.

@hawkw This brings up a good point. When calling a subscriber, the dispatcher probably wants to remove itself from the thread-local variable.

matprec commented 5 years ago

As Kanishkar has expressed interest in this topic as well, we should make sure to split the topics sensibly.

I'd propose splitting the project into these subtopic:

A library which connects and listens to spans and events happening on the server side, emitting them locally via a subscriber like interface. This includes a server side subscriber + the gRPC implementation
A storage layer which builds upon 1. and provides an useful interface for aggregation, metrics, queries, etc.
The (T)UI itself, which builds upon 2. and tries to display useful information, makes the information browsable etc. This includes designing useful widgets, different views (waterfall, timeline srubbing), filters, etc..

Kanishkar has expressed that he would like to work on the subscriber part (topic 1.), since i already did research into the ui part (3.). While i don't mind working on either, i'd be fine with this and already have some ideas!

This would enable us to work isolated for our core topics, while cooperating and interacting on topic 2..

I'd love to hear some feedback on this @hawkw, @kanishkarj!

kanishkarj commented 5 years ago

This seems great! Part 2 didn't click me, well it's great that you pointed it out.

Yeah, I guess I could work on the subscriber part. I shall try to understand how a subscriber works so as to define the protobuf schema as soon as possible so that we could work independently.

carllerche commented 5 years ago

The proposed breakdown is good. Please make sure to submit your proposals via the google summer of code UI by April 9th 👍

kanishkarj commented 5 years ago

@carllerche yeah definitely :)

tokio-rs / gsoc

Tokio console #1

Expected outcomes

Skills

Difficulty level

Libraries for TUI:

tui-rs

Cursive

Thoughts on backends

`tui-rs` vs `Cursive`

Returning `NotReady` without "registering" the current task

Tracking what the future is waiting for

Initial Release

EDIT

Are mitigation strategies possible?

tokio-rs / gsoc

Tokio console #1

Expected outcomes

Skills

Difficulty level

Libraries for TUI:

tui-rs

Cursive

Thoughts on backends

tui-rs vs Cursive

Returning NotReady without "registering" the current task

Tracking what the future is waiting for

Initial Release

EDIT

Are mitigation strategies possible?

`tui-rs` vs `Cursive`

Returning `NotReady` without "registering" the current task