jodh-intel commented 6 years ago

See https://github.com/kata-containers/kata-containers/issues/27.

jodh-intel commented 6 years ago

The plan for adding opentracing support to the agent is currently not clear: nominally, a "host system" will run a Jaeger agent (which the application will send spans to via UDP). However, we really do not want to have to add another component to the guest images.

Further, adding an extra component to the image when the agent is running as PID 1 will require custom code to launch the agent at VM startup... by the agent. Clearly, that isn't going to work unless the Jaeger client could "buffer" the spans in readiness for sending to the Jaeger agent when it eventually started.

The current thinking is to mandate a Jaeger agent in the host environment only (the host the runtime runs on). We could use a virtio-serial port or vsock to allow the agent to send spans out of the VM back to the agent running outside the VM. However, vsock can't really be the only solution given some users run old host kernels that do not have vsock support (and they cannot change the host kernel).

The current thinking is that we either offer multiple approaches to extracting the spans from the agent inside the VM (yuck), or we:

Use a virtio-serial port.
Configure Jaeger to log to shared memory and share in a block of shared memory from the host into the VM for the agent to write spans into. However, we need to know what happens if the block is too small / becomes "full" - does Jaeger support circular logging buffers, etc?

/cc @sameo, @mcastelino, @sboeuf, @bergwolf, @grahamwhaley.

grahamwhaley commented 6 years ago

Just a brief (slightly devils advocate) thought.... given:

we have enabled vsock in the runtime now
pretty much everything that happens in the agent is driven from the runtime, so 'one end' of the conversation is already being traced....

then I'd not be unhappy if we said, at least initially, that we would implement agent tracing via VSOCK, and if you don't have VSOCK available on your system, then, well, sorry, but you still have 90+% of the useful information just by looking at the runtime traces.

We could always put non-VSOCK as a secondary item, and if a member really really needed it, they are free to work with us to PR it ;-)

jodh-intel commented 6 years ago

I guess that could work, but it would be good to get input from some power users about whether this approach would work for them.

P.S. You missed an opportunity to use :imp: :smile:

jodh-intel commented 5 years ago

We're long overdue for an update on this issue, so brace yourself as I attempt to summarise the current situation...

Current serial behaviour

Startup

The default configuration causes the agent to connect to a virtio serial channel. That serial channel is wrapped in a yamux connection. The server uses this channel to create a gRPC server. The server serves "forever" - that is to say, the agent never exits the grpc packages Server.serve() function.

Shutdown

The agent "never exits"; once the workload has finished, the runtime sends a QMP QUIT command to shutdown the VM as it is no longer required.

However, iff a QMP QUIT where not sent, the gRPC server would exit, allowing the agent to exit.

Current vsock behaviour

Basically the same as serial but with one important exception - the gRPC server cannot exit since the vsock code path does not use yamux. This means the server has no way of detecting the client has disconnected and just blocks forever waiting for a new client to connect.

Ideal behaviour

The agent would exit the Server.serve() call to allow for a controlled shutdown. That would allow all trace spans created since agent startup to be sent to the Jaeger agent.

Given the design of the grpc package, golang constrains and our requirements, the only clean way I can find to shut down the agent in a controlled fashion is to require yamux since that layer provides the agent with the ability to detect when a client (kata-runtime) disconnects. The current tracing design also requires vsock to be enabled to allow both the VM's journal message and the trace spans to be collected outside the VM. Adding yamux to the vsock solution adds overhead but does provide the required semantics whereby a StopServer gRPC call can actually stop the server servicing that call since as yamux detects the client disconnect, the Server.serve() call ends allowing the agent to shutdown.

Notes

Note that we could create two codepaths and only enable vsock with yamux if:

The runtime is configured for agent tracing.
The configuration.toml sgodoc.org/google.golang.org/grpc#Server.Serve) call ends allowing the agent to shutdown.

Note that we could create two codepaths and only enable vsock with yamux if:

The runtime is configured for agent tracing.
The configuration.toml specifies use_vsock=true.

That would be annoying in that we'd have two paths to test, but would not impact the default no-proxy configuration.

Challenges

With vsock+yamux, the agent shuts down cleanly meaning that full trace is available.... theoretically :)

However, the overall system does not work. The problem is that whenever a client disconnect occurs, the agent's gRPC server will exit (and restart).

Forcing the existing keep-alive option in the runtime helps to stop the agent gRPC server from exiting, but is insufficient since disconnects still occur in the runtime and that leads to the runtime getting gRPC errors it does not expect ("request error: rpc error: code = Unavailable desc = transport is closing").

These errors caused by the fact that the api calls in api.go are standalone/discrete - the client (runtime) calls API like CreateSandbox() and StartSandbox() but APIs like these connect to the agent, perform one or more gRPC calls, then disconnect. We could tweak the runtime to tolerate such errors, but it's not ideal as the runtime has no way of knowing if such an error is caused by the agent's gRPC server restarting or if the agent just died.

One solution to this would be to have a new API that stores such options in a "handle" that is passed between the API calls for options like the existing keepConn. This would allow the code to be reworked fully so that only a single agent connection could be maintained (avoiding disconnects). But in fact we already have such a handle - all the API calls now specify a context.Context as their first argument. The complication is that the default context is actually an int so we would need to create a custom context type based on a struct pointer which implements the Context interface (looks quite easy to do).

Compromise behaviour

Adding yamux support on top of vsock should not really be necessary.

What we could do is to only trace the gRPC calls themselves. This could be accomplished by:

Changing the agents gRPC protocol so that all functions return a context.Context (currently, most return google.protobuf.Empty).
Make the agent create a trace span in each gRPC call (in the unary and stream gRPC interceptor functions). This could be a bit fiddly but looks possible.
Add the trace to the context.Context passed to each call (from the runtime).
Returns the "enriched" context as the gRPC functions return code.
Update the runtime to extract the agent span details from the returned context and submit that to the Jaeger agent.

Pros

This approach would no longer necessite requiring vsock for tracing meaning you'd be able to see agent trace even if you use a kata-proxy.

However, if vsock is enabled, we could still leverage that to pass the VM's journal messages out using the kata-journald-host-redirect.service systemd service and the accompanying vsock-to-text-server.sh script.
The agent would no longer need to shutdown cleanly as only the gRPC server calls would be traced.
Since the runtime would submit the agent traces to the Jaeger agent, none of the extra "scaffolding" currently required for tracing the agent would be required (we no longer need the jaeger-client-socat-redirector.service systemd service nor the vsock-to-udp-server.sh script).

Cons

We wouldn't be able to trace the agent setup and teardown (the code either side of the gRPC server it runs).

jodh-intel commented 5 years ago

/cc @mcastelino.

sboeuf commented 5 years ago

@jodh-intel I worked on this today, and I have been able to reproduce the weird behavior of the grpcServer.Serve() not returning... But the good news is that I found the fix for this. The culprit is the vsock Go package itself that had a bug, and by moving to the latest commit 676f733b747cd6406f297a51bd086ee8ec8abdbe our agent vendoring, things get fixed :)

So to summarize, I have been testing with vsock enabled + my PR https://github.com/kata-containers/agent/pull/448 + updated mdlayher/vsock vendoring. Make sure you build the agent with Go 1.11, as 1.10 might run into issues from the quick look I had at the patches. And then you have fun by running a simple docker container:

docker run -it --rm --runtime=kata-runtime busybox
#

If you want to make sure about the grpcServer.Serve() exiting without using any extra traces, you can use the following command:

sudo socat "stdin,raw,echo=0,escape=0x11" "unix-connect:/run/vc/vm/<containerid>/console.sock"

right after you started your container. The <containerid> can be simply found by looking at the qemu command line, or listing what's under /run/vc/vm/ for instance.

Anyway, the point being that you should see the following traces from the socat terminal:

level=info msg="stopServer signal received"
level=info msg="gRPC server stopped"
level=info msg="agent grpc server has been explicitly stopped"

at the moment you exit your container.

Oh yeah, one more thing, I have added a time.Sleep(5*time.Second) in the runtime code to make sure I was not triggering ExecuteQuit() too fast. Of course, this is simply to check the server had the time to tear down.

Let me know if vsock revendoring help, this way we can rule out the usage of yamux.

jodh-intel commented 5 years ago

@sboeuf - Mon Dieu! After spending the morning trying to recreate your findings, I think I now have. I was already using the latest vsock, but I was still using a go 1.10 compiler. Switching to 1.11 I am seeing what looks like the correct behaviour :pray:! It's like a miracle :joy: Compiler bugs suck though :frowning_face:

More testing, but this is looking promising at last!

sboeuf commented 5 years ago

Glad to hear that :+1: Please let me know about your progress now that we have solved the bulk of the problem!

mcastelino commented 5 years ago

@jodh-intel given that opencensus which supports creating custom exporters I attempted to create a simple poc to allow exporting of traces over any network transport. And at least conceptually this should work.

https://github.com/mcastelino/custom_exporter

Here I create a custom exporter which serializes the traces into the thrift format which is used by jaeger. On the far end the traces can be dumped to jaeger itself using the standard exporter

https://github.com/census-instrumentation/opencensus-go/blob/master/exporter/jaeger/agent.go#L41:6

Basically you can chop the exporter into two parts

On the host we would need something lile

https://github.com/census-instrumentation/opencensus-go/blob/master/exporter/jaeger/jaeger.go#L319

jodh-intel commented 5 years ago

Hi @mcastelino - opencensus sounds interesting but atm Kata is solely using opentracing.

As such, I'd like to focus on landing basic agent tracing support using opentracing as soon as possible.

Once that is in place, we can consider alternative tracers by raising the topic at the Architecture Committee meeting (and since it sounds from what you're saying that most of the alternatives are mostly compatible with each other, it shouldn't be too disruptive to switch at a later date).

sboeuf commented 5 years ago

@jodh-intel I'd argue that we should use the custom exporter as part of the initial PR for Opentracing. The reason is simple, by using this exporter, we can use some Go code to handle the vsock connection, while for now the PR introduces more systemd services to start scripts handling the vsock to UDP translation.

The huge benefit being than on the agent side, there will be no need for any vsock translation, and the agent code will be only manipulating generic Opentracing code.

Last point, by removing the systemd services, we will be able to support Opentracing when agent is running as init as well.

mcastelino commented 5 years ago

@sboeuf @jodh-intel we can do this after James's initial PR lands. I just wanted to highlight that moving to open census allows to address the issue once and for all. The bulk of the code should remain mostly unchanged. So the primary goal of moving to open census was not to support alternate backend. But more about getting back some control in our code so that we can control how we cross across the VM boundary.

With opencensus we can do the following

use vsock directly in the exporter
this allows us to run the agent as init (and not mandate systemd)
it also allows us in the future to support alternate backends (however this will need more work as the marshalling across vsock may end up being backend specific).

Note: I was was forced to marshall to thrift because the opencensus span had one unexported field. If all fields of opencensus were exported it would make our lives much easier.

The offending field is https://sourcegraph.com/github.com/census-instrumentation/opencensus-go@master/-/blob/trace/tracestate/tracestate.go#L42

// Tracestate represents tracing-system specific context in a list of key-value pairs. Tracestate allows different
// vendors propagate additional information and inter-operate with their legacy Id formats.
   type Tracestate struct {
       entries []Entry
   }

sboeuf commented 5 years ago

@mcastelino but if we land @jodh-intel PR first, we would need to also merge the rootfs part of it including new systemd service files, and then later, when we include a custom exporter, we'll need to update the rootfs again by removing those files. This sounds like a lot of back and forth while we could make things nice the first time.

jodh-intel commented 5 years ago

@sboeuf, @mcastelino - I'm definitely in favour of dropping vsock/udp scaffolding code if we can replace it with something simpler. However, opencensus is going to take (me atleast) time to grok and internalise. That's not to say we shouldn't do it of course, but fwics we'd need to add support for another protocol definition language in the agent build (thrift auto-generated code to accompany the existing gRPC code) and we'll still need host-side scaffold support if I'm understanding this correctly.

On the topic of thrift though @mcastelino - could you add the *.thrift files to your branch so we can see the source for all that auto-generated code?

jodh-intel commented 5 years ago

As I say, switching yet again to a different approach is going to take time. I'm happy to look at it but will need help from @mcastelino still I suspect.

@egernst may have input based on the hope of getting rudimentary agent tracing support landed in the next release (which looks somewhat unlikely from where I'm sitting now). If we are indeed time-pressured, we can work collectively on this. In fact, there's no reason we couldn't land an opencensus custom exporter PR into the agent now and I could then rework this branch yet again to leverage it.

jodh-intel commented 5 years ago

OK. Sitting comfortably? @sboeuf, @mcastelino and myself have just had a chat about juggling eels^W^Wagent tracing. What we're planning to take a staged approach to introducing agent tracing. The plan:

Land #415 with modifications

The current implementation works and provides full tracing (from agent start to agent shutdown). However, the current implementation will theoretically require runtime support (to disable QMP QUIT when tracing is enabled).

What we plan to do is add two new gRPC API's to #415: StartTracing() and StopTracing().

The behaviour will be as follows:

If agent.trace= is set on the guest kernel's command line, agent tracing will start as early as possible and the agent will exit after the sandbox is destroyed (DestroySandbox() called by the runtime.

This is the behaviour on the PR currrently. It is a special case and the only scenario where the agent shuts down and causes the VM to exit (rather than having the runtime control this behaviour).
If agent.trace= is not set on the guest kernel's command line, do not start tracing and do not exit the agent on DestroySandbox().

This is the current behaviour of the agent (no tracing enabled or possible).
If agent.trace= is not set on the guest kernel's command line but the runtime calls StartTracing(), trace all subsequent calls.

This will allow tracing to be controlled dynamically. Clearly, the trace will not include events that occurred before tracing was started.
If StartTracing() has been called and then DestroySandbox() is subsequently called, do not shutdown the agent.
If StopTracing() is called and StartTracing() has already been called, end tracing.
If StartTracing() is called and StartTracing() has already been called, return an error.
If StopTracing() is called and StartTracing() has not been called, return an error.

With the exception of the agent.trace= scenario, these two new calls will allow the runtime to retain control of the VM and also allow the runtime to continue to call the QMP QUIT to destroy the VM.

Create a follow-up PR to rework the agent tracing code to switch to OpenCensus

The advantages of this being:

We don't hard code the tracing implementation into the agent (implementation agnostic tracing / future-proof)

This means we can remove the Jaeger-vendored code.
By using an "exporter", we can remove the need for the vsock->udp service inside the VM (simplification).
Enabled agent tracing when the agent is running as init (a limitation of #415, which requires systemd).
Allows the existing rootfs / initrd images to support tracing without any changes.

Update the runtime to support tracing

To enable full agent tracing (startup to shutdown), the runtime will need to be modified to not call QMP QUIT. This should only happen when agent tracing is enabled but further only when "full" agent tracing is required.
The runtime will also need to be updated to provide a kata-runtime kata-trace [start|end] $id to support dynamic agent tracing.

Replace the host-side vsock->udp proxy script

415 introduces a `vsock-to-udp-server.sh` script that forwards Jaeger spans from the agent in the VM and forwards them on to the Jaeger agent running on the host. When the opencensus PR lands, we'll still need this script (but it will probably need adjusting slightly (port number).

What is required is a service that runs for the duration of the VM to forward traces onto the eventual trace collector. But we already have such a service - the kata-shim. As such, we can probably remove the script and make the shim also handle the trace comms to simplify the architecture further.

Further optimisations

There may be future enhancements we can make to improve the tracing even further.

sboeuf commented 5 years ago

@jodh-intel Thanks for summarizing this, it is a very accurate description! I have two comments:

The runtime will also need to be updated to provide a kata-runtime kata-trace [start|end] $id to support dynamic agent tracing.

I would expect the virtcontainers API (kata_agent.go) to be calling into the new gRPC functions StartTracing() and StopTracing() as part of the sandbox lifecycle. I would not expect that the end user would directly call into the Kata CLI to trigger those. By having this handled by the API internally, and relying on the config options from configuration.toml, this would ensure consistent tracing behavior for both kata-v1 and kata-v2.

What is required is a service that runs for the duration of the VM to forward traces onto the eventual trace collector. But we already have such a service - the kata-shim. As such, we can probably remove the script and make the shim also handle the trace comms to simplify the architecture further.

Having the shim handling this could work, but we need to enable it only on the shim representing the container that represents the sandbox. From the top of my head, I don't recall if this is something easy to identify. If this is not possible (again, needs to be investigated), the alternative solution would be to package a small binary that would be spawn from the runtime. The kata-v2 case is the easiest one since we could handle this directly from the runtime, as a separate go routine.

jodh-intel commented 5 years ago

By having this handled by the API internally, and relying on the config options from configuration.toml, this would ensure consistent tracing behavior for both kata-v1 and kata-v2.

That makes sense, certainly initially. I was imagining it could be useful for an admin to enable tracing on a long-running container at any point, hence the CLI options. But, yes, they could be added at a future date as required.

Having the shim handling this could work

Yep - like you, I haven't yet investigated how practical this actually is.

If this is not possible (again, needs to be investigated), the alternative solution would be to package a small binary that would be spawn from the runtime.

Right. I think we'll all agree we want to minimise such binaries, but this may indeed be a requirement. We could conceivably add a new runtime CLI to handle this such that if the admin runs something like kata-runtime trace listen or similar, the runtime could dynamically load a small module to deal with this. That would be in line with the VMCache approach (https://github.com/kata-containers/documentation/pull/393).

kata-containers / agent

Add opentracing support #322

Current serial behaviour

Startup

Shutdown

Current vsock behaviour

Ideal behaviour

Notes

Challenges

Compromise behaviour

Pros

Cons

Land #415 with modifications

Create a follow-up PR to rework the agent tracing code to switch to OpenCensus

Update the runtime to support tracing

Replace the host-side vsock->udp proxy script

415 introduces a `vsock-to-udp-server.sh` script that forwards Jaeger spans from the agent in the VM and forwards them on to the Jaeger agent running on the host. When the opencensus PR lands, we'll still need this script (but it will probably need adjusting slightly (port number).

Further optimisations

kata-containers / agent

Add opentracing support #322

Current serial behaviour

Startup

Shutdown

Current vsock behaviour

Ideal behaviour

Notes

Challenges

Compromise behaviour

Pros

Cons

Land #415 with modifications

Create a follow-up PR to rework the agent tracing code to switch to OpenCensus

Update the runtime to support tracing

Replace the host-side vsock->udp proxy script

415 introduces a vsock-to-udp-server.sh script that forwards Jaeger spans from the agent in the VM and forwards them on to the Jaeger agent running on the host. When the opencensus PR lands, we'll still need this script (but it will probably need adjusting slightly (port number).

Further optimisations

415 introduces a `vsock-to-udp-server.sh` script that forwards Jaeger spans from the agent in the VM and forwards them on to the Jaeger agent running on the host. When the opencensus PR lands, we'll still need this script (but it will probably need adjusting slightly (port number).