perfsonar / pscheduler

The perfSONAR Scheduler
Apache License 2.0
55 stars 33 forks source link

Terminate Runs of Background Tasks on Cancel #1425

Open arlake228 opened 5 months ago

arlake228 commented 5 months ago

I ran into a case in 5.1.0 where I built a mesh where some of the far-end hosts had not come up yet. Since latencybg tests are single-particpant, the tasks were still created and powstream is started. Since there is nothing on the far end, powstream would never yield any results.

pSConfig would run again and notice we had these background tasks that weren't producing any results (based on the run count in pscheduler api), send a cancel and then create a new task. Example of such a task below:

{
"test": {
"spec": {
"dest": "10.128.15.213",
"flip": false,
"schema": 1,
"source": "10.128.15.209",
"data-ports": {
"lower": 8760,
"upper": 9960
}
},
"type": "latencybg"
},
"tool": "powstream",
"detail": {
"cli": [
"--source",
"10.128.15.209",
"--dest",
"10.128.15.213",
"--data-ports",
"8760-9960",
"--flip"
],
"post": "P0D",
"runs": 1,
"slip": "P0D",
"added": "2024-04-04T19:45:39+00:00",
"diags": "Hints:\n  requester: 10.128.15.209\n  server: 10.128.15.209\nIdentified as everybody, local-interfaces\nClassified as default, friendlies\nApplication: Hosts we trust to do (almost) everything\n  Group 1: Limit 'throughput-sane-parallel' passed\n  Group 1: Want all, 1/1 passed, 0/1 failed: PASS\n  Application PASSES\nPassed one application.  Stopping.\nProposal meets limits\nPriority set to default of 0",
"hints": {
"server": "10.128.15.209",
"requester": "10.128.15.209"
},
"start": "2024-04-04T19:45:08+00:00",
"anytime": true,
"enabled": false,
"duration": "P1DT2S",
"exclusive": false,
"participant": 0,
"multi-result": true,
"participants": [
"10.128.15.209"
],
"runs-started": 1,
"href": "https://34.170.22.82/pscheduler/tasks/4c7790bb-3f7c-472e-bb6d-ac9cdd1e22d7",
"runs-href": "https://34.170.22.82/pscheduler/tasks/4c7790bb-3f7c-472e-bb6d-ac9cdd1e22d7/runs",
"first-run-href": "https://34.170.22.82/pscheduler/tasks/4c7790bb-3f7c-472e-bb6d-ac9cdd1e22d7/runs/first",
"next-run-href": "https://34.170.22.82/pscheduler/tasks/4c7790bb-3f7c-472e-bb6d-ac9cdd1e22d7/runs/next"
},
"schema": 3,
"archives": [
{
"data": {
"op": "put",
"_url": null,
"schema": 3,
"_headers": null,
"verify-ssl": false
},
"archiver": "http"
},
{
"data": {
"op": "put",
"_url": null,
"schema": 2,
"_headers": null
},
"archiver": "http"
}
],
"schedule": {
"start": "2024-04-04T19:45:08Z",
"until": "2024-04-05T19:45:08Z"
},
"reference": {
"psconfig": {
"created-by": {
"uuid": "F6BEB1D2-BE8B-4709-9CE9-8ED57013C3F1",
"user-agent": "psconfig-pscheduler-agent",
"agent-hostname": "ps-dev-staging-el9-tk1.c.esnet-perfsonar.internal"
}
},
"display-task-name": "GCP Packet Loss Tests",
"display-task-group": [
"GCP Tests"
]
},
"href": "https://34.170.22.82/pscheduler/tasks/4c7790bb-3f7c-472e-bb6d-ac9cdd1e22d7"
}

After canceling the task, pSConfig would create a new task. The problem is that the "cancel" operation doesn't kill the powstream. This led to a build up of powstreams (and quite a bit of memory usage), as new ones got created every hour for tasks in this category. I know we have discussed this before but was having trouble finding the issue for it. I think pSConfig is doing the right thing with the info it has, but it'd be nice to kill the underlying background process on cancel.

mfeit-internet2 commented 5 months ago

pSConfig would run again and notice we had these background tasks that weren't producing any results (based on the run count in pscheduler api), send a cancel and then create a new task.

The run count is the number of runs completed, so it would make sense that a long-running latencybg would show zero until it completes. The count of runs started is in the details as runs-started. Using that might get better results.

Stopping runs underway is covered by #129. It will require some work; I'm not sure how much yet. I'll bring that into 5.2.0 and could be talked into making a 5.1.x bugfix.

mty22 commented 2 months ago

Hi @arlake228

I've also run into a similar issue, particularly with latencybg test and powstream memory usage which leads to oomkiller being invoked on other services such as opensearch, thus causing system instability on perfSONAR (v5.1.2) hosts with 16G RAM.

At least from my observations, the pscheduler-runner systemd unit invokes /usr/libexec/pscheduler/daemons/runner, which then invokes usr/libexec/pscheduler/classes/tool/powstream/run (and subsequently invokes powstream binary to run tests). At some point, powstream binaries are left running and new tests are invoked.

In my case, here's what it looks like over time on a single host within a mesh of 16 servers over 2 days:

image

This is within the last 24 hour period:

image

Noting that all 16 servers show similar issues, some are worse than others.

On the specific server in question:

[root@somefancyhost ~]# netstat -nalp | grep powstream | grep tcp | wc -l
210
[root@somefancyhost ~]# netstat -nalp | grep powstream | grep udp | wc -l
137
[root@somefancyhost ~]# ps aux | grep powstream | wc -l
277

On the oldest running powstream pid:

[root@somefancyhost  ~]# lsof -np 440
COMMAND   PID       USER   FD   TYPE    DEVICE SIZE/OFF      NODE NAME
powstream 440 pscheduler  cwd    DIR     253,0      279       128 /
powstream 440 pscheduler  rtd    DIR     253,0      279       128 /
powstream 440 pscheduler  txt    REG     253,0   237304  17034374 /usr/bin/powstream
powstream 440 pscheduler  mem    REG     253,0   636848  33978986 /usr/lib64/libpcre2-8.so.0.11.0
powstream 440 pscheduler  mem    REG     253,0   175760  33978989 /usr/lib64/libselinux.so.1
powstream 440 pscheduler  mem    REG     253,0  2592552  34645670 /usr/lib64/libc.so.6
powstream 440 pscheduler  mem    REG     253,0   108136  33685639 /usr/lib64/libgcc_s-11-20231218.so.1
powstream 440 pscheduler  mem    REG     253,0   914376  34645673 /usr/lib64/libm.so.6
powstream 440 pscheduler  mem    REG     253,0   149216  33839734 /usr/lib64/libnss_resolve.so.2
powstream 440 pscheduler  mem    REG     253,0    36304  33793754 /usr/lib64/libcap.so.2.48
powstream 440 pscheduler  mem    REG     253,0   858936  34645666 /usr/lib64/ld-linux-x86-64.so.2
powstream 440 pscheduler    0r  FIFO      0,13      0t0 179512011 pipe
powstream 440 pscheduler    1w  FIFO      0,13      0t0 179512012 pipe
powstream 440 pscheduler    2w  FIFO      0,13      0t0 179512013 pipe
powstream 440 pscheduler    3uW  REG     253,5        4  50331821 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/.powlock
powstream 440 pscheduler    4r   CHR       1,9      0t0         9 /dev/urandom
powstream 440 pscheduler    5u  IPv4 193866091      0t0       TCP <removed src ip>:59345-><removed dst ip>:owamp-control (ESTABLISHED)
powstream 440 pscheduler    6u   REG     253,5        0  50331818 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/pow.ssTeYq (deleted)
powstream 440 pscheduler    7u   REG     253,5        0  50331818 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/pow.ssTeYq (deleted)
powstream 440 pscheduler    8u  IPv4 193875111      0t0       TCP <removed src ip>:41175-><removed dst ip>:owamp-control (ESTABLISHED)
powstream 440 pscheduler    9u   REG     253,5        0  50331924 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/pow.VqBN5u (deleted)
powstream 440 pscheduler   10u   REG     253,5        0  50331924 /var/pscheduler-server/runner/tmp/tmpqzndihb8/9d09b5f3-fad5-4bc9-a34a-a2c7b18b5f86-2024Aug01T084447303411/pow.VqBN5u (deleted)
powstream 440 pscheduler   12u   REG     253,6        0       170 /tmp/owpskips.zNYTA8 (deleted)
powstream 440 pscheduler   13u   REG     253,6        0       601 /tmp/owpskips.NE7yDv (deleted)

strace on the same pid (440) indicates that it's alive and well (I can provide a 120s sample if required). Random sampling of other powstream pids running look similar.

I also see the same issues within our perfSONAR (v5.1.2) dev env (which is a smaller mesh of 3 servers within the same site).

In addition, I was chatting with another perfSONAR server operator who mentioned to me that they don't use latencybg tests due to it being problematic.

FWIW, I had initially thought the issue was being caused by owstats (as mentioned in this issue which I had created), however I don't think that's the case (red herring?). Although it did slow the OOM events down after swapping from chronyd back to ntpd.