RFC: supporting third-party network stack such as TLDK

amysaq2023 commented 1 year ago

Description

As an application kernel, gVisor provides developers with the opportunity to build a lightweight pod-level kernel and allows for more agile development and deployment than the host kernel. To maximize the advantage of gVisor's flexibility, we propose an enhancement to its network module: a solution to support TLDK for better performance, and also want to further discuss about whether there is a more general way to support more third-party network stack such as Smoltcp, F-Stack etc.

Our Implementation to support TLDK

Since cloud-native applications are highly sensitive to network performance, we have expanded gVisor to support a high-performance user-level network stack called TLDK. This has resulted in significantly better network I/O performance in certain scenarios. To support TLDK network stack, we need to enable CGO in gVisor , as TLDK is currently implemented in C. We then initialized the TLDK stack through a cgo wrapper, based on the network type specified in the container boot config, and set up the TLDK socket opts interface in gVisor. Later network syscalls used gVisor's TLDK socket ops and invoked TLDK socket operation implementation through the cgo wrapper. One of the key factors for gVisor's significant performance improvement with TLDK is that we support device (SR-IOV) passthrough with TLDK. This not only enhances network I/O performance but also reduces the attack surface on the host kernel. The original gVisor netstack cannot support drivers for device passthrough, but TLDK can work with DPDK as the frontend driver for device passthrough. Moreover, we have provided a proper thread model and enabled an interrupt mode to avoid busy polling in typical DPDK scenarios. In this mode, the I/O thread wakes up when an event is raised by the host kernel upon receiving a packet from the NIC, and starts to read all available packets in DMA. It then wakes up the corresponding goroutine to receive the packets. This approach ensures efficient use of CPU resources, while avoiding unnecessary busy polling that can negatively impact application performance. 未命名绘图 (7)

Performance with TLDK

We compared runc and gVisor with TLDK, and the results show significant performance improvements in network I/O sensitive scenarios:


            Redis SET     Redis GET
runc:       335709 RPS    890301 RPS
runsc/TLDK: 617306 RPS    1391876 RPS

Further Discussion

While supporting TLDK, we had to modify the gVisor code to support another network stack socket ops, which incurred significant development costs. Therefore, in addition to proposing the support for TLDK in gVisor, we would like to open a discussion about whether there is a more general way for users to choose a third-party network stack without modifying gVisor. One possible solution we are considering is exposing the network interface from the API to the ABI and building third-party network stacks as plugins that fit with these ABIs.

We would appreciate any insights or feedback from the community on this proposal and the further discussion matter and are open to exploring other potential solutions. Thanks.

Is this feature related to a specific bug?

No.

Do you have a specific solution in mind?

As decribed in 'Description' section.

kevinGC commented 1 year ago

Very interested in making this happen. Thinking of this as separate sub-issues:

General third-party stack support - I think this is great. The largest issue I see is API stability -- we develop and build gVisor+netstack as one big binary, so there's no defined API for network stacks. The primary concern for me is getting stuck on an API with problems. From experience I can tell you that we've changed that API within gVisor many times -- that flexibility is useful and finding a way to keep it is ideal. I wonder whether we can do API versioning (think Go modules-esque) so that stable APIs exist, but don't hamper development.
CGO in gVisor - gVisor/runsc can't introduce CGO as a dependency for security reasons. This will have to be explicitly turned on by plugin users.
TLDK performance - With those performance numbers I have a ton of questions. Too many for this post, but generally I'm curious whether your stack is portable or specifically tailored to your environment, e.g:
1. Can you have multiple pods on a node? Normally DPDK steals the entire NIC, but maybe you use SR-IOV to create multiple NICs.
2. Does SR-IOV tie you to particular hardware NICs? If I understand correctly it's not fully portable, which could create problems if different nodes have different NICs.
3. If this is running in Kubernetes, what network plugin (CNI) is used to set everything up?
4. Do non-gVisor pods run in the same environment?

Please let me know what you think. Also happy to discuss your specific setup in email/chat/wherever if that's easier.

tanjianfeng commented 1 year ago

@kevinGC Among those sub-issues, the core one is CGO.

gVisor/runsc can't introduce CGO as a dependency for security reasons.

Does CGO interface introduce security issue? In other words, if we introduce a rust-based component (also memory-safe) in sentry, does that break the security?
gVisor itself is a defense-in-depth solution, with the host kernel jailers (seccomp/cgroup/namespace/capabilites/...) as the last line of defense. Can we tradeoff sentry security for perforance? An example in hand (may be not readlly proper), directfs sacrifice the security by allowing open() in sandbox process.

This will have to be explicitly turned on by plugin users.

If we understand it correctly, pure go needs the decision made at compile time. Do we have a conditional compile mechanism in gvisor bazel?

amysaq2023 commented 1 year ago

@kevinGC Thanks for your quick response and we are happy to discuss more on these sub issues you have .

To answer sub-issue 3, in short, our stack can be portable to other environment and detailed reasons are below:

Can you have multiple pods on a node? Normally DPDK steals the entire NIC, but maybe you use SR-IOV to create multiple NICs.

Yes, we can support multiple pods on a node and yes, it is supported by using SR-IOV which can create multiple NICs.

Does SR-IOV tie you to particular hardware NICs? If I understand correctly it's not fully portable, which could create problems if different nodes have different NICs.

In our current implement for gVisor with TLDK+DPDK, it does not have requirements on NIC. As long as NIC can be used as virtio backend device, our solution to support TLDK can work on it.

If this is running in Kubernetes, what network plugin (CNI) is used to set everything up?

We do not use any CNI to set TLDK stack up. Instead, we invoke CGO wrapper to initialize TLDK stack during gVisor doing StartRoot().

Do non-gVisor pods run in the same environment?

Yes, non-gVisor pods can run with gVisor with TLDK pods in the same environment.

kevinGC commented 1 year ago

Does CGO interface introduce security issue? In other words, if we introduce a rust-based component (also memory-safe) in sentry, does that break the security?

We've never discussed the CGO interface on its own, i.e. with something other than C being called into. But my first take is that the runsc binary should always be flagged as no CGO. I think a good solution would be to leave runsc as pure Go, and have this plugin system usable by defining a different go_binary target. That way we keep the high level of security, and users who want to make the tradeoff just need to write their own BUILD target. So ideally you'd have your own target looking something like:

go_binary(
    name = "runsc-tldk",
    srcs = ["main.go"],
    pure = False,
    visibility = [
        "//visibility:public",
    ],
    deps = [
        "@dev_gvisor//runsc/cli",
        "@dev_gvisor//runsc/version",
        "//my/codebase/tldk:runsc_plugin",
    ],
)

This yields a few benefits:

gVisor remains CGO-free
Plugin network stacks can be developed independently of upstream gVisor
By consuming gVisor as a bazel dependency, you would pin to a specific version of gVisor. This may be useful when gVisor changes to avoid breaking API changes

@tanjianfeng what do you think? Since you already have a third-party network stack, we want to hear what setup would work for you. If you have specific ideas in mind, we'd love to hear them. Once we have some agreement here, we can get others onboard and actually make the changes.

gVisor itself is a defense-in-depth solution, with the host kernel jailers (seccomp/cgroup/namespace/capabilites/...) as the last line of defense. Can we tradeoff sentry security for perforance?

Yes. Generally such tradeoffs are implemented but off by default. For example, raw sockets are implemented because people need tools like tcpdump, but must be enabled via a flag. Since CGO introduces a security issue just by being present in the binary, we shouldn't compile it in by default.

@amysaq2023 that's super impressive that you're getting the benefits of kernel bypass without many of the traditional issues (e.g. machines being single-app only). A few more questions (if you can answer):

Are the nodes in that Redis benchmark VMs or actual machines? My understanding is that the performance boost mostly comes from cutting out the host network stack, but if these are VMs then I'd expect the host machine's stack to slow things down.
Did you consider using XDP instead of DPDK? I wonder how performant it would be relative to DPDK, and given that it's probably easier to use.
Generally, do you think it's DPDK or TLDK that provide the bulk of the performance improvement? I'd like to do some experimenting of my own, and am wondering whether I'm more likely to see performance differences by hooking kernel bypass up to netstack or TLDK up to an AF_PACKET socket.

amysaq2023 commented 1 year ago

@kevinGC

what do you think? Since you already have a third-party network stack, we want to hear what setup would work for you.

Thank you for your insightful suggestion on how to support TLDK while maintaining the high level of security in gVisor. We have an additional proposal to consider: First, we propose abstracting a set of APIs for gVisor's network stack. This way, third-party network stacks will only need to implement these APIs in order to be compatible with gVisor. Next, we will compile the third-party network stack with gVisor APIs implemented as an object file. This approach ensures seamless integration between gVisor and the third-party network stack. Most importantly, gVisor needs to support a method to invoke these APIs within the network stack binary. Currently, we are considering options such as using go plugins or implementing something similar. We feel like that this solution will more thoroughly decouple the development of third-party network stacks from gVisor. Additionally, supporting binary plugins may have potential benefits for other modules, like the filesystem, enabling support for third-party implementations in the future.

Are the nodes in that Redis benchmark VMs or actual machines? My understanding is that the performance boost mostly comes from cutting out the host network stack, but if these are VMs then I'd expect the host machine's stack to slow things down.

The nodes in the Redis benchmark are actual physical machines.

Did you consider using XDP instead of DPDK? I wonder how performant it would be relative to DPDK, and given that it's probably easier to use. Generally, do you think it's DPDK or TLDK that provide the bulk of the performance improvement? I'd like to do some experimenting of my own, and am wondering whether I'm more likely to see performance differences by hooking kernel bypass up to netstack or TLDK up to an AF_PACKET socket.

DPDK not only functions as a driver, but also offers various performance enhancements. For instance, it utilizes rte_ring for efficient communication with hardware and introduces its own memory management mechanisms with mbuf and mempool. Moreover, DPDK operates entirely at the user-level, completely detached from the host kernel, unlike XDP which still relies on hooking into the host kernel. Therefore, the performance enhancement achieved with TLDK+DPDK goes beyond just kernel bypass, benefiting from the improvements introduced by both TLDK and DPDK.

kevinGC commented 1 year ago

First, we propose abstracting a set of APIs for gVisor's network stack. This way, third-party network stacks will only need to implement these APIs in order to be compatible with gVisor.

Agreed! Maybe you could send a PR with the interface you use now to work with TLDK -- that would be a really good starting point. Much better than trying to come up with an arbitrary API, given that you've got this running already.

Next, we will compile the third-party network stack with gVisor APIs implemented as an object file. This approach ensures seamless integration between gVisor and the third-party network stack.

Right, if I understand correctly the build process for cgo requires building the object file first, then writing a Go layer around it that can call into it using the tools provided by import "C".

Most importantly, gVisor needs to support a method to invoke these APIs within the network stack binary. Currently, we are considering options such as using go plugins or implementing something similar.

Can you help me understand why we couldn't just build a static binary containing gVisor and the third party network stack? As part of the API we talked about above, gVisor can support registering third party netstacks. So the third party stack would contain an implementation of the API (socket ops like in your diagram), the cgo wrapper, the third party stack itself, and an init function that registers the stack to be used instead of netstack:

import "pkg/sentry/socket"

func init() {
  socket.RegisterThirdPartyProvider(linux.AF_INET, &tldkProvider)
  // etc..
}

This keeps everything building statically and avoids issues introduced by go plugins as far as I can tell, but maybe I'm missing something.

kevinGC commented 1 year ago

Something I should've been more clear about regarding the static binary idea: I'm suggesting that the existing, cgo-free runsc target remain as-is, and that we support third party network stacks by having multiple BUILD targets. So the existing target will look mostly (or entirely) the same as it is today:

go_binary(
    name = "runsc",
    srcs = ["main.go"],
    pure = True,
    tags = ["staging"],
    visibility = [
        "//visibility:public",
    ],
    x_defs = {"gvisor.dev/gvisor/runsc/version.version": "{STABLE_VERSION}"},
    deps = [
        "//runsc/cli",
        "//runsc/version",
    ],
)

And building runsc with a third party network stack requires adding another target (which could be in the same BUILD file, a different one, or even a separate bazel project):

go_binary(
    name = "runsc_tldk",
    srcs = ["main_tldk.go"],
    pure = False,
    tags = ["staging"],
    visibility = [
        "//visibility:public",
    ],
    x_defs = {"gvisor.dev/gvisor/runsc/version.version": "{STABLE_VERSION}"},
    deps = [
        "//runsc/cli",
        "//runsc/version",
        "//othernetstacks/tldk:tldk_provider",
    ],
)

Both go_binary targets are static, avoid go plugins and its headaches, and the default runsc binary remains cgo-free.

amysaq2023 commented 1 year ago

@kevinGC Great! We are fully onboard with the idea of introducing an additional target to support third-party networking stack. To kick things off, we will begin by preparing a PR that encompasses gVisor APIs for networking modules, along with our implementation of these APIs in TLDK for seamless integration with gVisor. We sincerely appreciate all the valuable insights shared throughout this discussion thread.

kevinGC commented 1 year ago

Just want to check on this and see if there's anything I can do help it along.

amysaq2023 commented 1 year ago

Hi Kevin, thanks for checking out. We have finished porting our modification of supporting TLDKv2 to current gVisor master branch and are currently working on refactoring some implementation to make it more general. I think we are on the right track, just needs a little more time due to the amount of code. If everything goes well, we will send out the patch next week.

kevinGC commented 1 year ago

Hey, back to see whether there's anything I can do to help here. We're really excited to try this out, benchmark, and see the effects on gVisor networking.

amysaq2023 commented 1 year ago

@kevinGC Thanks for reaching out! Sorry for the delay caused by the National Day holiday. We have just create a PR (https://github.com/google/gvisor/pull/9551) that introduces interface templates to support external network stack. Specific implemenation of these stack and socket operations will be provided in subsequent commits. We would greatly appreciate any suggestions regarding the current interface setup. For now, we are actively working on decoupling specific TLDK stack supports from sentry and making it more adaptable to general third-party stacks.

kevinGC commented 1 year ago

Thanks a TON. Just responded over there, but want to ask about testing here.

We'll want to test third party netstacks. I'm thinking that what you're contributing will only be testable if we have a similar environment (DPDK and such). Is that correct?

amysaq2023 commented 1 year ago

Thanks a TON. Just responded over there, but want to ask about testing here.

We'll want to test third party netstacks. I'm thinking that what you're contributing will only be testable if we have a similar environment (DPDK and such). Is that correct?

Hi Kevin, happy to hear that you are exploring third-party netstack testing too. In the current version we're working on, once we complete the implementation of all the necessary glue layers, we will compile the TLDK repository within it. (It will become clearer when we share the socket ops glue layer for the plugin netstack in the next commit.) With this binary, you can easily test it by 'docker run' to start a container, just as original runsc with native netstack does.

amysaq2023 commented 11 months ago

Hi @kevinGC , we have recently pushed our implementation of supporting plugin network stack into gVisor. You can now compile the runsc binary with support for the plugin stack by executing the following command: bazel build runsc:runsc-plugin-stack. This build process will seamlessly incorporate our sample third-party network stack, TLDK. To activate the plugin stack, simply adjust the runtimeArgs to include --network="plugin". This enables users to switch to the plugin stack for their networking needs.

We has conducted performance testing of gVisor when utilizing the plugin stack. We chose Redis as our benchmark and test the network performance under various conditions: 1. within runc; 2. within runsc with netstack on KVM; 3. within runsc with the plugin on KVM containers. The results are quite promising—the performance of runsc with the plugin stack closely rivals that of runc, delivering double RPS compared to runsc with netstack. We have documented the detailed performance metrics in our commit log for your review. The current performance test is being conducted with the software-implemented virtio-net backend, which is less optimized. Performance can be further improved if using VF (SR-IOV) passthrough.

Thanks for your continued support and patience throughout this development process. Your feedback on our design and implementation is greatly welcomed and appreciated.

amysaq2023 commented 11 months ago

Besides, we have encountered a specific issue after integrating cgo to support the plugin stack, which we'd like to bring to the table for discussion. The problem arises when the mmap trap mechanism, utilized on the KVM platform, leads to a container panic following the introduction of cgo. The root of the issue is traced back to the _cgo_sys_thread_start function in Go runtime. Within this function, all signals are set to be blocked. The process then advances to _cgo_try_pthread_create, where an mmap call is made. This call is trapped by the KVM platform's seccomp rules for mmap.

When the host kernel processes the trapped syscall, it checks whether the SIGSYS signal is blocked. If it finds SIGSYS blocked, it resets the signal handler to its default address, thereby overwriting the handler we established during KVM initialization. Consequently, when the host kernel attempts to handle SIGSYS, it encounters a 0x0 signal handler, leading to the default action for SIGSYS—coredump—which results in container panic.

bpftrace result at force_sig_info_to_task:

As a temporary workaround, we have reverted the KVM mmap trap mechanism. However, this solution is not intended for merging. We are actively seeking a more appropriate fix for this issue and would highly appreciate any suggestions, ideas, or discussions on how to resolve this problem.

amysaq2023 commented 10 months ago

@kevinGC Happy New Year :) Just reach out and check whether we have any comments on https://github.com/google/gvisor/pull/9551 ?

EtiennePerot commented 10 months ago

(@kevinGC has been out sick for a while, sorry for the delay.)

Regarding the cgo vs KVM platform problem: One solution would be to add conditional compilation tags to remove the KVM platform from the runsc cgo build. This would of course prevent using DPDK with the KVM platform, but it would at least unblock this pull request, and create a good reason to support a new high-performance open-source platform that is compatible with both cgo and DPDK.

amysaq2023 commented 9 months ago

(@kevinGC has been out sick for a while, sorry for the delay.)

Sorry to hear that. There is no rush on reviewing this PR and please ignore the PING sent yesterday. (I did not realize that Kevin was out sick at that moment. That PING was a regular check-out only. I am really sorry about that.) Wish Kevin all the best and get well soon.

Regarding the cgo vs KVM platform problem: One solution would be to add conditional compilation tags to remove the KVM platform from the runsc cgo build. This would of course prevent using DPDK with the KVM platform, but it would at least unblock this pull request, and create a good reason to support a new high-performance open-source platform that is compatible with both cgo and DPDK.

Yes, we agree that we can currently remove kvm platform from runsc cgo build and also look forward a new high-performance platform. We will work on addressing this comment and make sure that this pull request will not be blocked by the platform support.

avagin commented 9 months ago

@amysaq2023 I think I figured out how we can resolve the kvm problem. The mmap hook is needed to map sentry memory regions into the guest vm. However, we can streamline the process by mapping the entire sentry address space to the VM during its initialization. Here is a draft patch: https://github.com/google/gvisor/commit/c2ab4cb4d9daf501c09ae3ac3a624a78825d8c8d.

EtiennePerot commented 9 months ago

@avagin's solution may address the KVM compatibility issue, but it should not block work on this. It is OK to disable KVM in the cgo build for the time being. This decouples the work of integrating plugin network stacks inside gVisor from the work of making KVM work with cgo.

amysaq2023 commented 9 months ago

@amysaq2023 I think I figured out how we can resolve the kvm problem. The mmap hook is needed to map sentry memory regions into the guest vm. However, we can streamline the process by mapping the entire sentry address space to the VM during its initialization. Here is a draft patch: c2ab4cb.

Hi Andrei, thanks for your proposal. We have tried this PoC locally and found that it solves runsc binary with cgo working with KVM; however, it does not solve our issue here: plugin stack will use its own memory address space which is different from sentry. When we tested this PoC with plugin stack, it stuck at initializing plugin stack. From recording flamegraph, we found the sandbox process was keeping mmap/munmap. Any thought on this issue? Thanks.

avagin commented 9 months ago

@amysaq2023 could you run runsc under strace (strace -fo strace.log -s 1024 ./runsc ... ) and share strace.log?

amysaq2023 commented 9 months ago

@amysaq2023 could you run runsc under strace (strace -fo strace.log -s 1024 ./runsc ... ) and share strace.log?

Hi @avagin, after further debugging, we find that with a minor adjustment, we've successfully made PoC work! This adjustment is a temporary work-around. We plan to proceed with refining the plugin stack's memory layout to ensure full compatibility with the PoC. Thank you immensely for your support.

avagin commented 7 months ago

@amysaq2023 FYI: Here is a small linux kernel change (will be in 6.9) that reduces a memory overhead when an entire sentry address space is mapped into VM: https://github.com/torvalds/linux/commit/a364c014a2c1ad6e011bc5fdb8afb9d4ba316956

avagin commented 1 month ago

https://github.com/google/gvisor/pull/10954 starts running a minimal set of tests on buildkite. We need to add more tests. In ideal case, we need to run image tests and network specific tests.

https://github.com/alipay/tldk/pull/4 is needed to be merged, otherwise tldk fails to build in the gvisor docker build container.

google / gvisor