Add timeout to SSH queue scan command

Icemole commented 6 months ago

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a Too many open files error.

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense. <- Edit: not anymore, see below.

Fix #190.

albertz commented 6 months ago

If the SSH key isn't properly set, sisyphus will schedule stacked queue scan commands which will ask for a password

So you add "-o", "ConnectTimeout" to the SSH command? That seems like a bad workaround to me for this problem.

What about -o BatchMode=yes?

Or -o PasswordAuthentication=no?

albertz commented 6 months ago

... sisyphus will schedule stacked queue scan commands which will ask for a password. This will lead to subprocess buildup, which eventually crashes the manager with a Too many open files error.

Also, I wonder about this. How can this happen? Are there uncleaned zombie procs? Did you check this? I think this is another separate thing you can and should fix.

Edit I just checked. I think p.communicate(..., timeout=...) will actually not kill the proc in case the timeout happened. And when p goes out of scope, Python GC will also not kill the proc. Actually, in Popen.__del__, in case the proc is still alive, there is this logic:

        if self.returncode is None and _active is not None:
            # Child is still running, keep us alive until we can wait on it.
            _active.append(self)

So, it means, if the proc never dies itself, it will always stay alive. To kill it, you must explicitly kill it somewhere, but it seems we never do that?

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

Icemole commented 6 months ago

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

I think that's a much more intuitive solution than what I proposed.

albertz commented 6 months ago

Btw, if we would just use subprocess.run, it would properly cleanup (kill) the subproc in case of a timeout. So instead of p = subprocess.Popen(...) and then p.communicate, just directly use subprocess.run.

I think that's a much more intuitive solution than what I proposed.

Additionally, I still would add the -o BatchMode=yes or -o PasswordAuthentication=no though. There is no need to wait for the timeout.

albertz commented 6 months ago

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

Icemole commented 6 months ago

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

I can do that as well. I actually envisioned joining that code into another common file because some of it is the exact same code, and we would be removing a bit of code duplication.

albertz commented 6 months ago

Btw, I wonder about similar wrong usages of subprocess.Popen in Sisyphus. There are probably a few. I just checked, it seems like in most of our backend engines, we have it wrong in the same way. Maybe you should apply the same fix for all of them?

I can do that as well.

I see the same wrong usage in these files:

aws_batch_engine.py
load_sharing_facility_engine.py
simple_linux_utility_for_resource_management_engine.py
son_of_grid_engine.py

I actually envisioned joining that code into another common file because some of it is the exact same code, and we would be removing a bit of code duplication.

I would not do that for now. It's basically one line of code (subprocess.run). There is not really much code duplication. Moving this one line of code elsewhere would just make it more complicated. Or do you also mean the gateway logic? But in any case, I would not do this here in this PR.

critias commented 6 months ago

I would not do that for now. It's basically one line of code (subprocess.run). There is not really much code duplication. Moving this one line of code elsewhere would just make it more complicated. Or do you also mean the gateway logic? But in any case, I would not do this here in this PR.

I agree with Albert here. And thanks for figuring this out, I didn't realized that the subprocess.Popen process would stay open.

Icemole commented 6 months ago

Roger that :) will implement right away.

albertz commented 6 months ago

Looks all fine now. Except maybe this aspect:

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense.

I wonder if that make sense to reuse this setting for the timeout. I personally would have kept those decoupled. I.e. for now, leave the 30 seconds, and if you want to be able to configure that, make a separate setting for it.

Icemole commented 6 months ago

Looks all fine now. Except maybe this aspect:

I've also changed the hardcoded 30 second interval for an interval depending on gs.WAIT_PERIOD_BETWEEN_CHECKS, which I find makes more sense.

I wonder if that make sense to reuse this setting for the timeout. I personally would have kept those decoupled. I.e. for now, leave the 30 seconds, and if you want to be able to configure that, make a separate setting for it.

Okay, I'll change it back. I thought it made sense given the name.

For now I'll change it back to 30 seconds, but upon reviewing the global_settings.py file, I found the following potential candidates ("separate settings") already in there:

‎‎WAIT_PERIOD_SSH_TIMEOUT.
WAIT_PERIOD_QSTAT_PARSING.

Icemole commented 5 months ago

I've been testing this, and while my manager didn't crash anymore because of too many open files, there was an abnormally high memory usage that extended throughout time whenever the subprocesses didn't successfully finish:

[2024-06-13 07:24:26,842] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]
[2024-06-13 07:24:26,953] ERROR: Error to submit job, return value: 255
[2024-06-13 07:24:26,953] ERROR: SBATCH command: sbatch -J i6_core.returnn.extract_prior.ReturnnComputePriorJobV2.yYMFtGVTZgnn.run -o work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn/engine/%x.%A.%a [...] --wrap=/usr/bin/python3 /home/nbeneitez/work/sisyphus/too-many-open-files-fix/recipe/sisyphus/sis worker --engine long work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run
[2024-06-13 07:24:26,953] ERROR: Error: [...] Permission denied (publickey).

When messages like this one were spammed, I noticed a very high memory usage from the manager's side. I'll try to review the code a bit more.

albertz commented 5 months ago

The code looks fine to me though (sorry, forgot to review again, did that now).

You should debug this better. I.e. while you see the memory increase, is this memory increase really in the main proc of the manager? Or maybe just in some sub procs? Are the timed-out sub procs properly cleaned up (i.e. no zombie procs hanging around)?

albertz commented 5 months ago

I'm testing with a simple script whether there are any problems w.r.t. memory consumption or so. Script here, or just directly here:

import subprocess

while True:
    try:
        subprocess.run(["cat", "/dev/zero"], timeout=0.01)
    except subprocess.TimeoutExpired:
        print("TimeoutExpired")

Just run this, and watch the memory consumption meanwhile. I don't see that there is any increase in memory.

So, I don't think that this is the problem.

albertz commented 5 months ago

When messages like this one were spammed

What do you mean by spammed? How often do you get them? It should wait each time for the timeout, or not?

Icemole commented 5 months ago

You should debug this better. I.e. while you see the memory increase, is this memory increase really in the main proc of the manager? Or maybe just in some sub procs? Are the timed-out sub procs properly cleaned up (i.e. no zombie procs hanging around)?

Agreed, I panicked because I was making our head node go super slow and canceled the program :sweat_smile: I'll make a mental note to scan /proc thoroughly next time this happens.

What do you mean by spammed? How often do you get them? It should wait each time for the timeout, or not?

It doesn't wait for the full timeout (30 seconds), but it's not spammy (as in < 1 second per print) either:

[2024-06-13 06:30:15,908] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]                                          
[2024-06-13 06:30:16,021] ERROR: Error to submit job, return value: 255     
...                                                                                               
[2024-06-13 06:30:28,284] INFO: Submit to queue: work/i6_core/returnn/extract_prior/ReturnnComputePriorJobV2.yYMFtGVTZgnn run [1]

The job immediately fails because of Permission denied (publickey)., and it seems to wait 10-20 seconds more before running another job.

Icemole commented 5 months ago

I got into the issue again. So I scanned the process, and it seems to open many submit_log.run file descriptors at some point and immediately close them, and do nothing else for 15-30 seconds. The ls commands below were run with a timeout of 0.5 seconds more or less:

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:06 17 -> 'pipe:[518288582]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:58 18 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:06 19 -> .../submit_log.run
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:43 20 -> .../submit_log.run
l-wx------ 1 nbeneitez domain_users 64 Jun 14 05:15 21 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:21 22 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:59 23 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 05:21 24 -> .../submit_log.run
lr-x------ 1 nbeneitez domain_users 64 Jun 14 04:54 25 -> .../submit_log.run
l-wx------ 1 nbeneitez domain_users 64 Jun 14 05:52 26 -> 'pipe:[518288582]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 04:53 27 -> 'pipe:[518288583]'
l-wx------ 1 nbeneitez domain_users 64 Jun 14 06:16 28 -> 'pipe:[518288583]'
lr-x------ 1 nbeneitez domain_users 64 Jun 14 06:02 29 -> .../submit_log.run
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

$ ls -lha /proc/3442960/fd
total 0
dr-x------ 2 nbeneitez domain_users  0 Jun 13 04:12 .
dr-xr-xr-x 9 nbeneitez domain_users  0 Jun 13 04:12 ..
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 0 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 1 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 10 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 11 -> 'pipe:[474983584]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 12 -> 'pipe:[474983584]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 13 -> 'pipe:[474967409]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:12 14 -> 'pipe:[474967409]'
lr-x------ 1 nbeneitez domain_users 64 Jun 13 04:12 15 -> 'pipe:[474988135]'
l-wx------ 1 nbeneitez domain_users 64 Jun 13 04:13 16 -> 'pipe:[474988135]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 2 -> /dev/pts/8
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 4 -> 'socket:[474942638]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 5 -> 'socket:[474942639]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 6 -> 'anon_inode:[eventpoll]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 7 -> 'socket:[474942644]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 8 -> 'socket:[474942645]'
lrwx------ 1 nbeneitez domain_users 64 Jun 13 04:12 9 -> '/dev/shm/pym-3442960-9n996ekj (deleted)'

All submit_log.run were pointing to different jobs. I don't see anything suspicious about this, it's just writing the queuing log to these files and then closing them. In general there seem to be 16 opened files, many of which are sockets and pipes.

The memory usage is right now at 6.4% of 250 GB and it keeps growing slowly but surely (update: 6.6% at the time of finishing the comment, so 10 minutes after more or less). ps aux | grep <manager-pid> only shows a process:

USER         PID %CPU %MEM      VSZ   RSS     TTY      STAT  START    TIME   COMMAND
nbeneit+ 3442960  164  6.5 22736408 17258504 pts/8 Sl+ Jun13 2585:55 python3 sis m config/training.py

but htop does show many processes with the same name but different PID, some of which have been running for some hours already. See the two images below. imagen imagen

In proc/3442960/task I see 106 subdirectories. So the program seems to spawn many threads, some of which aren't being killed, since according to man 5 proc:

Underneath each of the /proc/pid directories, a task subdirectory contains subdirectories of the form task/tid, which contain corresponding information about each of the threads in the process, where tid is the kernel thread ID of the thread.

subprocess.run waits for the program to complete and then returns. Is it possible that some of the processes aren't being finished? But we should have the timeout for this :thinking: why are some subprocesses not being cleaned?

Update: I watched ls /proc/3442960/task | wc for a minute or so but the number of active threads didn't increase.

albertz commented 5 months ago

Why do you actually think the memory leak is related to the PR here? It only increases when you get those "Error to submit job" and otherwise stays constant?

So are there alive subprocs or not? Are there any zombie procs? (What does ps ax say?)

What are the threads? (E.g. what does pyspy or pystack say?)

Where is the memory allocated? (What does a mem profiler say?)

Icemole commented 5 months ago

Yes, it could very well be the case that there's some other error going on that's not related to the PR, but I was able to observe it after doing this change, so I've reported it here. If you want we can move this discussion to another PR.

There are no zombie processes. ps ax | grep Z reports nothing.

Sadly I don't have sudo powers and ptrace is not enabled, so I can't dump stats for pystack. However, there are exactly 106 warnings with the message WARNING(process_remote): Failed to attach to thread <pid>: Operation not permitted, with the PIDs being exactly the same as the ones in /proc/3442960/task. Besides, I've just noticed that all processes on /proc/3442960/task are a sequence order, which might point to the fact that they've been generated at the same time.

albertz commented 5 months ago

sudo powers and ptrace is not enabled

You don't need sudo for that. Just some admin has to enable ptrace. Then you can use it without ptrace. Please do that. I doesn't make sense that we waste time here by just guessing around without properly just debugging the issue.

Icemole commented 5 months ago

Thanks for the suggestion, I've already asked to enable ptrace for everyone. In the meantime, one of our admins got me the trace we were looking for: pystack_threads_tracker.log.

From the log it would seem as if half of the threads (~50) were just starting up (see task = get()), and the other half were waiting for some job to finish. Moreover, the latter seem to be training-related since they go through i6_core/returnn/training.py. As you said, I indeed don't see any abnormal behavior here...

albertz commented 5 months ago

got me the trace we were looking for ...

Is this the trace at the time when the problem occurs, i.e. during messages like this one were spammed, or during memory increase when watching memory usage in htop or so? It's important to get the trace at exactly this time. Also not just one trace but several traces during that period, to be sure you don't miss some interesting bits.

From that trace, I can also only say that there don't seem to be any weird threads running. But as you said, there are no more threads coming up (specifically during the mem increase period), the threads are likely not the issue. Also, as there are no zombie procs, or too many subprocs, nor too many open files, I don't see any indication that the mem leak is related to this PR here. I still wonder why you got this idea that it might be related. Or this abnormally high memory usage which you observe only occurs now with this PR and has never occurred before?

As the next step, I would do some actual memory profiling, to see where the memory increase occurs in the code.

In any case, I think independently of this, we can go forward with this PR here. There are some outstanding issues which I commented on.

Icemole commented 5 months ago

Definitely agreed, this was off-topic.

@albertz I've fixed your feedback.

rwth-i6 / sisyphus

Add timeout to SSH queue scan command #191