Run long background ssh commands on host

fruch commented 1 week ago

We have multiple reports of this issue https://github.com/scylladb/scylladb/issues/14004

which basically boils down to the node being stressed out during the a long operation, that we don't automatically retry (because logically we can't retry those on disconnect, like decommission/cleanup and such)

we observed this issue is happening all the time with other commands as well, but those commands get retried, and we don't notice it.

the suggest is to add implementation in remoter code that can send off the command to run on the host, while polling it from time to time and then getting back the log of the operation

each command in this flow (beside the initial call), can be retried, and this lower the chances of being hit by this issue to practicality zero (i.e. just one quick command of starting the command would be with a retries, and the rest would be)

other options would be to try to protect the sshd from those spikes, and preserve resources for it. but that's not completely guaranteed as well, and might affect scylla (since those resource would not be available for it)

Ref: https://github.com/scylladb/scylladb/issues/14004

fruch commented 1 week ago

@roydahan @soyacz @vponomaryov, what do you think about this direction ?

roydahan commented 1 week ago

So IIUC, you suggest that the remoter will have an "async" mode and polling is your direction of implementing it?

fruch commented 1 week ago

So IIUC, you suggest that the remoter will have an "async" mode and polling is your direction of implementing it?

yes, something like the following

pid = remoter.run('f{cmd} 2>&1 > cmd_output.log ; echo "$!"').stdout
while curr_time > timeout:
    # check pid still alive
    sleep(10)
# read the cmd_output.log back and return it

roydahan commented 1 week ago

In general I'm fine with it, I'm just trying to think if we can change something in the tools/long running commands themselves so somehow they can send us back the result / write it into a file / socket / whatever that will make it more integrated between the tools.

fruch commented 1 week ago

In general I'm fine with it, I'm just trying to think if we can change something in the tools/long running commands themselves so somehow they can send us back the result / write it into a file / socket / whatever that will make it more integrated between the tools.

well yes, it would be best if long running command could be list inside scylla and tracked via APIs, and not just a commandline tool.

I think there was some working related to it done by @Deexie with task manager, you can see https://github.com/scylladb/scylla-dtest/pull/2992 for how it's being use

soyacz commented 6 days ago

So IIUC, you suggest that the remoter will have an "async" mode and polling is your direction of implementing it?

yes, something like the following
pid = remoter.run('f{cmd} 2>&1 > cmd_output.log ; echo "$!"').stdout
while curr_time > timeout:
    # check pid still alive
    sleep(10)
# read the cmd_output.log back and return it

need to run it with nohup I believe (and & at the end, possibly echo "$!" not being needed).

Alternative approach is to create simple http server on SCT side and make remote node to send results (output log file and status) using curl. This server could be used also to different cases (like monitoring coredumps, where db node would send info to sct that coredump happened). One thing I'm not yet sure is how it would work when running SCT locally with cloud test env (but somehow syslog-ng works).

vponomaryov commented 5 days ago

@roydahan @soyacz @vponomaryov, what do you think about this direction ?

So IIUC, you suggest that the remoter will have an "async" mode and polling is your direction of implementing it?

yes, something like the following
pid = remoter.run('f{cmd} 2>&1 > cmd_output.log ; echo "$!"').stdout
while curr_time > timeout:
    # check pid still alive
    sleep(10)
# read the cmd_output.log back and return it

Sounds very useful. Similar idea was implemented for K8S with it's dynamic pods and re-reading pod logs in case of connection breakage. It was really useful.

Doing it the call-back way in current case would be really great - to not lose time due to the waiting interval size...

fruch commented 5 days ago

and now with a suggested implementation, that I didn't yet tested inside longvity https://github.com/scylladb/scylla-cluster-tests/pull/7834

fruch commented 5 days ago

So IIUC, you suggest that the remoter will have an "async" mode and polling is your direction of implementing it?

yes, something like the following
pid = remoter.run('f{cmd} 2>&1 > cmd_output.log ; echo "$!"').stdout
while curr_time > timeout:
    # check pid still alive
    sleep(10)
# read the cmd_output.log back and return it
need to run it with nohup I believe (and & at the end, possibly echo "$!" not being needed).

Alternative approach is to create simple http server on SCT side and make remote node to send results (output log file and status) using curl. This server could be used also to different cases (like monitoring coredumps, where db node would send info to sct that coredump happened). One thing I'm not yet sure is how it would work when running SCT locally with cloud test env (but somehow syslog-ng works).

syslong-ng works with reverse tunneling

I think http server is a bit over complicating the whole idea

soyacz commented 4 days ago

syslong-ng works with reverse tunneling

I think http server is a bit over complicating the whole idea

yes, it's complicating idea but gives more abilities to SCT not only for this case. E.g. having script monitoring coredumps presence and sending notification to SCT when it appeared. I understand it's larger effort so I'm ok with your proposal.

scylladb / scylla-cluster-tests

Run long background ssh commands on host #7781