hashicorp / go-plugin

Golang plugin system over RPC.
Mozilla Public License 2.0
5.25k stars 450 forks source link

Plugin stuck in broken state waiting for RPC address #306

Open agaudreault opened 6 months ago

agaudreault commented 6 months ago

When trying to start a plugin, it sometimes ends up in a broken state and the only way to unblock is to restart the host.

Main function trying to start the plugin

        t.client[pluginName] = goPlugin.NewClient(&goPlugin.ClientConfig{
            HandshakeConfig: handshakeConfig,
            Plugins:         pluginMap,
            Cmd:             exec.Command(pluginPath, args...),
            Managed:         true,
        })
        rpcClient, err := t.client[pluginName].Client()
        if err != nil {
            // getting log: waiting for RPC address
            // then error: timeout while waiting for plugin to start
            return nil, fmt.Errorf("unable to get plugin client (%s): %w", pluginName, err)
        }

Function trying to get the plugin if it exist, and somewhat recover on error

    client, err := t.client[pluginName].Client()
    if err != nil {
        // Getting error: exec: Stdout already set
        // cannot run kill() because it requires a valid client
        return nil, fmt.Errorf("unable to get plugin client (%s) for ping: %w", pluginName, err)
    }
    if err := client.Ping(); err != nil {
        t.client[pluginName].Kill()
        t.client[pluginName] = nil
        return nil, fmt.Errorf("could not ping plugin will cleanup process so we can restart it next reconcile (%w)", err)
    }

Everytime we create a new client (restart the program), there is a new process that is created in the following state

USER               PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
agaudreault      99885   0.0  0.0 410089056     48   ??  UE    1:58pm   0:00.00 /Users/agaudreault/plugin-bin/my-plugin
agaudreault      98102   0.0  0.0 410089056     48   ??  UE    1:56pm   0:00.00 /Users/agaudreault/plugin-bin/my-plugin
agaudreault      96380   0.0  0.0 410089056     48   ??  UE    1:55pm   0:00.00 /Users/agaudreault/plugin-bin/my-plugin
agaudreault      94682   0.0  0.0 410089056     48   ??  UE    1:54pm   0:00.00 /Users/agaudreault/plugin-bin/my-plugin
agaudreault      92807   0.0  0.0 410089056     48   ??  UE    1:52pm   0:00.00 /Users/agaudreault/plugin-bin/my-plugin
agaudreault      91001   0.0  0.0 410089056     48   ??  UE    1:51pm   0:00.00 /Users/agaudreault/plugin-bin/my-plugin
agaudreault      89169   0.0  0.0 410089056     48   ??  UE    1:50pm   0:00.00 /Users/agaudreault/plugin-bin/my-plugin
agaudreault      87411   0.0  0.0 410089056     48   ??  UE    1:48pm   0:00.00 /Users/agaudreault/plugin-bin/my-plugin

The process is running locally on Mac

    System Software Overview:
      System Version: macOS 14.4.1 (23E224)
      Kernel Version: Darwin 23.4.0
      Chip: Apple M1 Max

Running on version

github.com/hashicorp/go-plugin v1.6.0
joshwizzy commented 6 months ago

The default value of StartTimeout is set 1 minute when not explicitly set it ClientConfig What does your plugin do before calling goplugin.Serve?

agaudreault commented 6 months ago

@joshwizzy not much to be honest. The plugin works, when it starts correctly. You can find the code here: https://github.com/agaudreault/argo-rollouts/blob/13e8829a6691e27743f00a4ac1ec5163e51251c1/test/cmd/step-plugin-e2e/main.go.

I found out that deleting the executed binary solved the issue as well. Easier than restarting when running locally!

joshwizzy commented 5 months ago

@agaudreault Would you mind adding an issue to the argo-rollouts repo with instructions on how to reproduce the broken state. It might be easier for folks who are interested/familiar with the argo-rollouts plugin integration to investigate. I am unfamiliar with argo-rollouts but this seems to be part of an initiative to add Step plugins to the project. Most people here may not have the required context to help out.

agaudreault commented 5 months ago

@joshwizzy I am the one implementing the step plugin with other maintainers hence why I'm reaching out about the behavior. This is not an argo rollout issue, but you can see the code and perhaps find out if something is wrong with how the plugin is started. It's quite simple so building a repro app should be quite easy with access to existing code.

The steps are

  1. Start an application
  2. Let the application start the plugin
  3. Kill the application process
  4. Start the application
  5. Plugin cannot start

It seems like the go-plugin does not support to be re-executed if it is not shutdown gracefully

joshwizzy commented 5 months ago

@agaudreault I am unable to reproduce this using the basic greeter example. I followed these steps:

  1. comment out this line so the plugin is not killed when the client exits defer client.Kill()

  2. build and trun the example app

 ./basic   
2024-04-23T15:02:57.225+0300 [DEBUG] plugin: starting plugin: path=./plugin/greeter args=[./plugin/greeter]
2024-04-23T15:02:57.227+0300 [DEBUG] plugin: plugin started: path=./plugin/greeter pid=86088
2024-04-23T15:02:57.227+0300 [DEBUG] plugin: waiting for RPC address: plugin=./plugin/greeter
2024-04-23T15:02:57.237+0300 [DEBUG] plugin.greeter: message from plugin: foo=bar timestamp=2024-04-23T15:02:57.237+0300
2024-04-23T15:02:57.237+0300 [DEBUG] plugin.greeter: plugin address: address=/var/folders/sl/h6pntx3j05l0fk99605hzxy40000gn/T/plugin918908959 network=unix timestamp=2024-04-23T15:02:57.237+0300
2024-04-23T15:02:57.237+0300 [DEBUG] plugin: using plugin: version=1
2024-04-23T15:02:57.239+0300 [DEBUG] plugin.greeter: message from GreeterHello.Greet: timestamp=2024-04-23T15:02:57.239+0300
Hello!

The plugin process stays running after the client exits

ps ax |grep greeter
86088 s000  S      0:00.01 ./plugin/greeter
  1. Run the client again successfully
    ./basic            
    2024-04-23T15:04:38.176+0300 [DEBUG] plugin: starting plugin: path=./plugin/greeter args=[./plugin/greeter]
    2024-04-23T15:04:38.178+0300 [DEBUG] plugin: plugin started: path=./plugin/greeter pid=86897
    2024-04-23T15:04:38.178+0300 [DEBUG] plugin: waiting for RPC address: plugin=./plugin/greeter
    2024-04-23T15:04:38.191+0300 [DEBUG] plugin.greeter: message from plugin: foo=bar timestamp=2024-04-23T15:04:38.191+0300
    2024-04-23T15:04:38.191+0300 [DEBUG] plugin.greeter: plugin address: address=/var/folders/sl/h6pntx3j05l0fk99605hzxy40000gn/T/plugin523748303 network=unix timestamp=2024-04-23T15:04:38.191+0300
    2024-04-23T15:04:38.191+0300 [DEBUG] plugin: using plugin: version=1
    2024-04-23T15:04:38.193+0300 [DEBUG] plugin.greeter: message from GreeterHello.Greet: timestamp=2024-04-23T15:04:38.193+0300
    Hello!

Two instances of the plugin are now running

ps ax |grep greeter
86088 s000  S      0:00.01 ./plugin/greeter
86897 s000  S      0:00.02 ./plugin/greeter
joshwizzy commented 5 months ago

@agaudreault Seeing as you are running on an M1 mac Could you try running the app with this setting and see if it makes a difference export GODEBUG=asyncpreemptoff=1

https://github.com/hashicorp/terraform-provider-aws/issues/20274 https://github.com/hashicorp/terraform-provider-aws/issues/16073 https://yaleman.org/post/2021/2021-01-01-apple-m1-terraform-and-golang/