spdfg / elektron

Elektron is a lightweight, power-aware, pluggable Mesos framework that behaves as a playground to experiment with different scheduling policies to schedule ad-hoc jobs in docker containers.
GNU General Public License v3.0
4 stars 3 forks source link

calls to rapl/Cap.go#Cap() result in connection leaks. #20

Open pradykaushik opened 4 years ago

pradykaushik commented 4 years ago

rapl power cap does not close the ssh connection. Even though the ssh session is closed, the connection is still maintained to allow for establishing new sessions without having to redo the handshake. However, the existing code establishes a new connection on each call to the function. This results in the tcp: accept4: too many open connections error.

There are two ways to go about this.

  1. Establish the connection once and maintain the connection object. Subsequent ssh sessions should be opened using the same connection object. Ssh connection object should be closed when no longer needed.
  2. Close the ssh connection object (defer connection.Close()) before returning from the function.
ridv commented 4 years ago

This was more or less a hacky way to get this done quickly. I'd always intended to get back and write a daemon which runs on the worker nodes and receives messages from Elektron and sends back an acknowledgement.

Maybe I'll assign this to myself if I have some time. I'll create a new repo since it'll be a new binary.

So instead of having an SSH process, the daemon will receive a payload and perform the change to RAPL, replying with a confirmation the message was received and everything went OK.

I think we should use gRPC for this.

pradykaushik commented 4 years ago

Makes sense and sounds good. Couple of things to consider. Just a couple of things that come to mind.

  1. It is possible that the power-capping algorithm sends a large number of capping requests. However, updating the powercap in quick succession is impractical. We would need a way in which we can queue up requests. The capper daemon can then run as a cron job on a configured schedule and pick the most recent powercap value. Whether we queue up the requests or just update a single object is left depends on the implementation.
  2. We can have the capper as a submodule.
ridv commented 4 years ago

I've started working on this so I'll assign this to myself. I started out with a gRPC implementation which I then realized was totally over kill. I've switched to just making the daemon a server that listens on a port (9090 by default) for incoming JSON payloads which contain the percentage.

May add some weak auth if time allows.

Re: the rate limiting, this is definitely something we should consider but from the point of view of running experiments, though I agree that it is impractical to change caps quickly, I would be cautious of introducing anything that is out of the control of the user running the experiment.

I'd err on the side of letting the user control how often they send a capping request in academic exercises.

pradykaushik commented 4 years ago

Makes sense. For experimentation purposes, we should provide more freedom to the user. Also, it is possible that the user is trying to test the efficiency of RAPL and how quick it is to respond to changes in the power-cap. I agree with you on this.

pradykaushik commented 4 years ago

pull request #21 merged. Will close this after codebase has been refactored to pass payloads to rapl-daemon instead of opening ssh connections.