perfsonar / pscheduler

The perfSONAR Scheduler
Apache License 2.0
55 stars 34 forks source link

pSc-scheduler locked for locally initiated tasks: “never scheduled a run for the task” #521

Closed igarny closed 1 year ago

igarny commented 7 years ago

Problem: there are seldom cases, where locally initiated tasks are not being scheduled. Error message from CLI is: “never scheduled a run for the task.” (see below) The weirdness about it is that the task is a single participant with local node as source and is a non-blocking one.

  1. As an observation it is also appealing, that the task ID (generated by pSc at CLI) is not reported in the pSc.logs, unless the debug flag is used.
  2. In a normal operation: the moment pSc generates a task ID and a run, there is an event in the pSc.log that signifies the creation of the task even without the debug flag.
  3. Another observation is that remote participants are able to negotiate tests with the local host, through the API even during the "blocking" state. Their requests (task IDs ) are being reported in the pSc.log the moment they are run.
  4. Also pSc monitor operates fine and shows tasks as they stay in the scheduler and dynamically updates the status of those on the screen. Nevertheless there are no tasks from the local mesh configuration visible.

pscheduler task --debug rtt --dest es.net

2017-07-21T14:20:11 Debug signal ignored; already not debugging
2017-07-21T14:20:11 Debug discontinued
2017-07-21T14:20:11 Assistance is from localhost
2017-07-21T14:20:11 Forcing default slip of PT5M
2017-07-21T14:20:11 Converting to spec via https://localhost/pscheduler/tests/rtt/spec
Submitting task...
2017-07-21T14:20:11 Fetching participant list
2017-07-21T14:20:11 Spec is: {"dest": "es.net", "schema": 1}
2017-07-21T14:20:14 Got participants: {u'participants': [None], u'null-reason': u'No source specified'}
2017-07-21T14:20:14 Lead is None
2017-07-21T14:20:14 Null reason is No source specified
2017-07-21T14:20:14 Pinging https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/
2017-07-21T14:20:14 psmp-gn-bw-01-lon-uk.geant.net is up
2017-07-21T14:20:14 Posting task to https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks
2017-07-21T14:20:14 Data is {"test": {"type": "rtt", "spec": {"dest": "es.net", "schema": 1}}, "schedule": {"slip": "PT5M"}, "schema": 1}
Task URL:
https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/c77a6d33-0fd1-406b-95fd-0e7020664a71
2017-07-21T14:20:20 Posted https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/c77a6d33-0fd1-406b-95fd-0e7020664a71
Running with tool 'ping'
Fetching first run...
2017-07-21T14:20:20 Fetching https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/c77a6d33-0fd1-406b-95fd-0e7020664a71/runs/first
psmp-gn-bw-01-lon-uk.geant.net never scheduled a run for the task.
igarny commented 7 years ago

Here is the communication on the topic from Slack: Hi Andy

[3:50] Last Friday I had this strange behavior of meshconfig not defining any latency task with psc

[3:50] Today I see this on 2 servers

[3:51] Anything I can do to understand why would meshconfig ignore the config in the -tasks file

andy [3:51 PM] you mean rtt tests?

igarny [3:52 PM] powstream

[3:52] I am missing powstream for some hours already

andy [3:54 PM] can you send me the endpoints of the tests that are missing?

igarny [3:55 PM] everything coming from this server: https://prod-psmdsh-gn.geant.net/perfsonar-graphs/?url=http://prod-psma-gn-01-buc-ro.geant.net/esmond/perfsonar/archive/&source=psmp-gn-owd-01.gen.ch.geant.net&dest=psmp-gn-owd-01-ath-gr-v6.geant.net&ipversion=6#start=1500040553&end=1500645353&summaryWindow=3600&timeframe=1w

[3:56] I will initiate a manual latency test now

andy [3:57 PM] which of these is the server that is running the mesh-config that is supposed to create the test?

igarny [3:57 PM] the sender

andy [4:00 PM] looks like task is created but it has no runs

[4:00] task is here: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef

[4:01] lack of runs is here: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef/runs

igarny [4:01 PM] added this Plain Text snippet pscheduler task latencybg --source psmp-gn-owd-01.gen.ch.geant.net --packet-padding 0 --dest psmp-gn-owd-01-ath-gr-v4.geant.net --ip-version 4 --packet-interval 0.1 --duration PT400S --packet-count 1200 Submitting task... Task URL: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/54aa6b48-e577-4940-993e-6a2ca5f24abf Running with tool 'powstream' Fetching first run... psmp-gn-owd-01.gen.ch.geant.net never scheduled a run for the task. Add Comment Click to expand inline 8 lines

andy [4:02 PM] ok, that's consistent with the one from the mesh-config

[4:02] maybe try restarting pscheduler-scheduler?

[4:02] it should at least be creating the first run

igarny [4:02 PM] yes ...thanks

[4:03] at least I have a lead now...

andy [4:03 PM] fwiw, this is the query i used to check if pscheduler knew about the task: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks?json={%22tool%22:%22powstream%22,%20%22test%22:{%22spec%22:{%22dest%22:%22psmp-gn-owd-01-ath-gr-v6.geant.net%22}}}

mfeit [4:04 PM] I have to bolt, but turning on debug in the scheduler would be useful.

igarny [4:04 PM] https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/751e28fe-ff2a-40c1-9c75-6a7f32390df4/runs/first

[4:04] but it is taking too long to respond

andy [4:04 PM] if you do /first it blocks until there is a run or you timeout

igarny [4:04 PM] nevertheless pscheduler monitor works

andy [4:05 PM] you can also do a refresh on https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef/runs. it should be non-empty if the pscheduler-scheduler is working like expected

igarny [4:06 PM] the problem with this is that I am running this task from CLI

[4:06] and for 400s only

andy [4:07 PM] the task i linked to above doesn't end until tomorrow, so if the pscheduler-server comes back and is working properly, it should fill-in a run

igarny [4:08 PM] added this Plain Text snippet Jul 21 14:01:18 psmp-gn-bw-01-gen-ch runner INFO 2233259: With iperf3: throughput --omit PT10S --duration PT20S --source psmall-poz1.man.poznan.pl --ip-version 6 --dest psmp-gn-bw-01-gen-ch.geant.net --parallel 1 Jul 21 14:01:52 psmp-gn-bw-01-gen-ch runner INFO 2233259: Run succeeded. Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Debug signal ignored; already not debugging Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Debug discontinued Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Assistance is from localhost Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Forcing default slip of PT5M Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Converting to spec via https://localhost/pscheduler/tests/latencybg/spec Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Fetching participant list Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Spec is: {"dest": "psmp-gn-owd-01-ath-gr-v4.geant.net", "packet-padding": 0, "source": "psmp-gn-owd-01.gen.ch.geant.net", "ip-version": 4, "packet-interval": 0.10000000000000001, "duration": "PT400S", "packet-count": 1200, "schema": 1} Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Got participants: {u'participants': [u'psmp-gn-owd-01.gen.ch.geant.net']} Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Lead is psmp-gn-owd-01.gen.ch.geant.net Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Pinging https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/ Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG psmp-gn-owd-01.gen.ch.geant.net is up Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Posting task to https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Data is {"test": {"type": "latencybg", "spec": {"dest": "psmp-gn-owd-01-ath-gr-v4.geant.net", "packet-padding": 0, "source": "psmp-gn-owd-01.gen.ch.geant.net", "ip-version": 4, "packet-interval": 0.10000000000000001, "duration": "PT400S", "packet-count": 1200, "schema": 1}}, "schedule": {"slip": "PT5M"}, "schema": 1} Jul 21 14:02:00 psmp-gn-bw-01-gen-ch task DEBUG Posted https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/751e28fe-ff2a-40c1-9c75-6a7f32390df4 Jul 21 14:02:00 psmp-gn-bw-01-gen-ch task DEBUG Fetching https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/751e28fe-ff2a-40c1-9c75-6a7f32390df4/runs/first Jul 21 14:05:43 psmp-gn-bw-01-gen-ch scheduler INFO Started Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Debug signal ignored; already not debugging Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Debug discontinued Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Assistance is from localhost Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Forcing default slip of PT5M Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Converting to spec via https://localhost/pscheduler/tests/latencybg/spec Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Fetching participant list Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Spec is: {"dest": "psmp-gn-owd-01-ath-gr-v4.geant.net", "packet-padding": 0, "source": "psmp-gn-owd-01.gen.ch.geant.net", "ip-version": 4, "packet-interval": 0.10000000000000001, "duration": "PT400S", "packet-count": 1200, "schema": 1} Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Got participants: {u'participants': [u'psmp-gn-owd-01.gen.ch.geant.net']} Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Lead is psmp-gn-owd-01.gen.ch.geant.net Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Pinging https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/ Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG psmp-gn-owd-01.gen.ch.geant.net is up Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Posting task to https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Data is {"test": {"type": "latencybg", "spec": {"dest": "psmp-gn-owd-01-ath-gr-v4.geant.net", "packet-padding": 0, "source": "psmp-gn-owd-01.gen.ch.geant.net", "ip-version": 4, "packet-interval": 0.10000000000000001, "duration": "PT400S", "packet-count": 1200, "schema": 1}}, "schedule": {"slip": "PT5M"}, "schema": 1} Jul 21 14:05:55 psmp-gn-bw-01-gen-ch task DEBUG Posted https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/a63d1af1-fd85-4d64-b8b8-75095eb6db2f Jul 21 14:05:55 psmp-gn-bw-01-gen-ch task DEBUG Fetching https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/a63d1af1-fd85-4d64-b8b8-75095eb6db2f/runs/first Jul 21 14:06:16 psmp-gn-bw-01-gen-ch runner INFO 2233421: Running https://psmp-gn-bw-01-gen-ch.geant.net/pscheduler/tasks/4f4e1903-550b-4b2f-92cc-8fdba8fd4c82/runs/bc5a076a-6277-4f93-b133-3a805d5df0fe Add Comment Click to expand inline 35 lines

igarny [4:08 PM] the psc log show now signs of service being blocked

[4:08] I have restarted it already

[4:09] nevertheless the 3 CLI attempt ended with "never scheduled a run for this task"

andy [4:09 PM] k, did you do a pscheduler debug on scheduler as well per Mark's suggestion? (edited)

[4:10] that may or may not show us something useful

igarny [4:10 PM] I am runnign the tasks with debug, but now I will start a full debug

andy [4:11 PM] yeah, probably debug on the server side would be helpful too. just turing it on for the scheduler is probably enough to start with since its the one responsible for creating runs

[4:11] hey, i see a run now here: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef/runs

[4:12] so looks like something you just did fixed it

igarny [4:12 PM] my 4th CLI attempt though failed with the same error

[4:13] I have started the debug mode only

[4:13] apart from the CLI, the powstream processes are back

andy [4:13 PM] if you do a ps auxw | grep e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef do you get back any powstream commands?

igarny [4:14 PM] yes

[4:14] 3 of them

[4:14] ps auxw | grep e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef 601 707 0.0 0.0 7096 416 ? Ss 14:13 0:00 /usr/bin/powstream -p -d /var/lib/pscheduler/tool/powstream/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef-2017Jul21T141100367455 -s 0 -c 1200 -i 0.1 -6 -t -S psmp-gn-owd-01.gen.ch.geant.net [psmp-gn-owd-01-ath-gr-v6.geant.net]:861 1412959200 2238 0.0 0.0 0 0 pts/0 S+ 14:14 0:00 grep e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef 601 60614 0.0 0.0 7096 1156 ? S 14:11 0:00 /usr/bin/powstream -p -d /var/lib/pscheduler/tool/powstream/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef-2017Jul21T141100367455 -s 0 -c 1200 -i 0.1 -6 -t -S psmp-gn-owd-01.gen.ch.geant.net [psmp-gn-owd-01-ath-gr-v6.geant.net]:861 601 60648 0.0 0.0 7052 284 ? Ss 14:11 0:00 /usr/bin/powstream -p -d /var/lib/pscheduler/tool/powstream/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef-2017Jul21T141100367455 -s 0 -c 1200 -i 0.1 -6 -t -S psmp-gn-owd-01.gen.ch.geant.net [psmp-gn-owd-01-ath-gr-v6.geant.net]:861

andy [4:14 PM] ok, perfect that means powstream is running

igarny [4:14 PM] yes...I will go to the next server to apply the fix

andy [4:14 PM] i also see a result now: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef/runs/3c8a2fa5-861a-46f0-b0bc-66cc3ef862d2

[4:15] if you do a cli command now does it scheduler a run? even if you do a simple pscheduler task rtt --dest es.net we should get an idea (edited)

[4:16] basically just curious if the scheduler was stuck before or if it is just waaay behind on adding runs (edited)

igarny [4:16 PM] so the CLI with rtt test succeeded

[4:17] I will try that on the second server

andy [4:17 PM] ok, so must have been stuck before. all you did was restart pscheduler-scheduler and turn debug on?

igarny [4:17 PM] yes

[4:17] well several CLI attempts as well

andy [4:18 PM] hmmm ok, when @mfeit gets back we should maybe see if there is any other diagnostics he wants you to grab from the logs.

igarny [4:18 PM] added this Plain Text snippet pscheduler task rtt --dest es.net Submitting task... Task URL: https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/1a234791-a74f-428d-b944- 1ec1264e581b Running with tool 'ping' Fetching first run... psmp-gn-bw-01-lon-uk.geant.net never scheduled a run for the task. Add Comment Click to expand inline 8 lines

igarny [4:18 PM] London server

andy [4:19 PM] ok, so same issue like you suspected

igarny [4:19 PM] added and commented on this Plain Text snippet no Jul 21 14:15:19 psmp-gn-bw-01-lon-uk runner INFO 2428566: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/39b3c942-7515-4193-bfc2-ebcbab496b0c/runs/21db5b41-2d86-42cb-b0ef-c4f5d84b6216 Jul 21 14:15:19 psmp-gn-bw-01-lon-uk runner INFO 2428566: With iperf3: throughput --duration PT30S --source psmp-gn-bw-01-lis-pt-v6.geant.net --ip-version 6 --dest psmp-gn-bw-01-lon-uk.geant.net --parallel 1 Jul 21 14:15:52 psmp-gn-bw-01-lon-uk runner INFO 2428566: Run succeeded. Jul 21 14:17:38 psmp-gn-bw-01-lon-uk runner INFO 2428573: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/eddad4dd-761a-41d0-b62d-76c2c7396d69/runs/7c12c527-ca43-40d8-b129-5a205b9e2c18 Jul 21 14:17:38 psmp-gn-bw-01-lon-uk runner INFO 2428573: With iperf3: throughput --duration PT30S --source psmp-gn-bw-01-vie-at-v4.geant.net --ip-version 4 --dest psmp-gn-bw-01-lon-uk.geant.net --parallel 1 Jul 21 14:18:11 psmp-gn-bw-01-lon-uk runner INFO 2428573: Run succeeded. 1 Comment Click to expand inline 6 lines the logs of pscheduler

igarny [4:19 PM] psc logs is in this state for a min already

andy [4:20 PM] my guess is on the first server the pscheduler-scheduler restart is what actually fixed it, probably just took a minute to see that fact since it was likely busy catching up on all the missed runs

[4:20] not sure why it got stuck to begin with though

igarny [4:21 PM] so on London server I still have not restarted psc-sched

[4:21] but the second run of rtt from CLI with debug mode results in events in the logs

[4:21] still the outcome is the same

[4:21] the task is blocked

[4:22] my question to you and Mark would be, how do I detect it....

[4:23] should I be running regular rtt tasks, so that I know when to restart the daemon

[4:23] in the event log of psc I just got an update

igarny [4:23 PM] added this Plain Text snippet Jul 21 14:17:38 psmp-gn-bw-01-lon-uk runner INFO 2428573: With iperf3: throughput --duration PT30S --source psmp-gn-bw-01-vie-at-v4.geant.net --ip-version 4 --dest psmp-gn-bw-01-lon-uk.geant.net --parallel 1 Jul 21 14:18:11 psmp-gn-bw-01-lon-uk runner INFO 2428573: Run succeeded. Jul 21 14:19:44 psmp-gn-bw-01-lon-uk runner INFO 2428370: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/893c9759-9d36-43ea-b1c5-fd2e23a0b204/runs/0040d207-e606-4363-8392-5ce122cb54cd Jul 21 14:19:44 psmp-gn-bw-01-lon-uk runner INFO 2428370: With iperf3: throughput --duration PT20S --source ps.test.manlan.internet2.edu --dest 62.40.106.131 --parallel 1 Jul 21 14:20:08 psmp-gn-bw-01-lon-uk runner INFO 2428370: Run succeeded. Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Debug signal ignored; already not debugging Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Debug discontinued Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Assistance is from localhost Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Forcing default slip of PT5M Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Converting to spec via https://localhost/pscheduler/tests/rtt/spec Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Fetching participant list Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Spec is: {"dest": "es.net", "schema": 1} Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Got participants: {u'participants': [None], u'null-reason': u'No source specified'} Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Lead is None Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Null reason is No source specified Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Pinging https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/ Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG psmp-gn-bw-01-lon-uk.geant.net is up Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Posting task to https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Data is {"test": {"type": "rtt", "spec": {"dest": "es.net", "schema": 1}}, "schedule": {"slip": "PT5M"}, "schema": 1} Jul 21 14:20:20 psmp-gn-bw-01-lon-uk task DEBUG Posted https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/c77a6d33-0fd1-406b-95fd-0e7020664a71 Jul 21 14:20:20 psmp-gn-bw-01-lon-uk task DEBUG Fetching https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/c77a6d33-0fd1-406b-95fd-0e7020664a71/runs/first Jul 21 14:21:00 psmp-gn-bw-01-lon-uk runner INFO 2428475: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/62c9cc08-da30-4a46-9f56-11775fd82d18/runs/5acc890f-c04f-4116-b98b-4f548f9adb3a Jul 21 14:21:00 psmp-gn-bw-01-lon-uk runner INFO 2428475: With iperf3: throughput --omit PT5S --duration PT10S --source lond-pt1.es.net --dest psmp-gn-bw-01-lon-uk.geant.net --window-size 67108864 --parallel 1 Jul 21 14:21:18 psmp-gn-bw-01-lon-uk runner INFO 2428475: Run succeeded. Jul 21 14:21:28 psmp-gn-bw-01-lon-uk runner INFO 2428393: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/5ee0fb22-49ad-4b0a-8a67-d9f683f8a291/runs/53ca04fc-426f-48e2-ad4a-b2f8c0cdb0cf Jul 21 14:21:28 psmp-gn-bw-01-lon-uk runner INFO 2428393: With iperf3: throughput --omit PT10S --duration PT20S --source perfsonar-nas.asnet.am --ip-version 4 --dest psmp-gn-bw-01-lon-uk.geant.net --parallel 1 Jul 21 14:22:02 psmp-gn-bw-01-lon-uk runner INFO 2428393: Run succeeded. Add Comment Collapse

andy [4:24 PM] no, the better question is why it happened at all. that should not happen

igarny [4:24 PM] meaning tasks that had been scheduled are being executed

[4:24] but new tasks are not being added

andy [4:25 PM] the task you just showed has a different source i think. also anything already on the schedule would continue to run fine

igarny [4:25 PM] also I am not able to make any conclusion based on the psc log

[4:25] yes you are right about the task

andy [4:26 PM] can you look up the pid of the pscheduler-server process and try running gstack PID?

[4:26] that should give a stack trace of what the pscheduler-server is doing

igarny [4:27 PM] I do not have that command on the server

[4:27] gstack 3517 -bash: gstack: command not found

[4:27] I do not have privileged access as well

[4:28] I am not supposed to manipulate the config of the servers outside the Puppet automation

andy [4:28 PM] ah ok, yeah its not standard. you could grab it with yum

mfeit [4:28 PM] I have a meeting in two minutes and a band thing to go to right afterward, so I won't be able to look this over until I get back.

igarny [4:28 PM] thanks Mark. NP

[4:29] it is one of these tricky things

mfeit [4:29 PM] Hopefully it's something simple.

igarny [4:29 PM] I am just looking for a way to understand how to detect it, so tat I can automate the restart of service

andy [4:33 PM] yeah, it seems like you are creating unnecessary work for yourself. that shouldn't be happening. i have 50 of these things deployed on esnet and can't remember the last time i had to restart any pscheduler stuff by hand. you could do a nightly restart if it'd make you feel better, but probably doing more than that is wasted cycles IMO. it'd be better to know what's causing it

igarny [4:34 PM] I can only tell these servers are added in several meshes

[4:35] pscheduler ping psmp-gn-bw-01-lon-uk pScheduler on psmp-gn-bw-01-lon-uk is alive.

[4:36] if I run from a remote host it states it is alive....but of course ...we can read the task descr

[4:36] Thanks Andy

[4:36] will try finding a clue

mfeit [4:59 PM] @igarny If there's a bug, I want to fix it so you don't have to detect anything and restart the service.

igarny [5:00 PM] yes of course... I have no doubts about it...

[5:00] I will open a ticket with an observation of mine

[5:00] or should I just type it here

mfeit [5:00 PM] Paste this conversation in there, too, please.

igarny [5:00 PM

mfeit-internet2 commented 1 year ago

I'm going to assume this has been resolved since it's been over five years.