Closed igarny closed 1 year ago
Here is the communication on the topic from Slack: Hi Andy
[3:50] Last Friday I had this strange behavior of meshconfig not defining any latency task with psc
[3:50] Today I see this on 2 servers
[3:51] Anything I can do to understand why would meshconfig ignore the config in the -tasks file
andy [3:51 PM] you mean rtt tests?
igarny [3:52 PM] powstream
[3:52] I am missing powstream for some hours already
andy [3:54 PM] can you send me the endpoints of the tests that are missing?
igarny [3:55 PM] everything coming from this server: https://prod-psmdsh-gn.geant.net/perfsonar-graphs/?url=http://prod-psma-gn-01-buc-ro.geant.net/esmond/perfsonar/archive/&source=psmp-gn-owd-01.gen.ch.geant.net&dest=psmp-gn-owd-01-ath-gr-v6.geant.net&ipversion=6#start=1500040553&end=1500645353&summaryWindow=3600&timeframe=1w
[3:56] I will initiate a manual latency test now
andy [3:57 PM] which of these is the server that is running the mesh-config that is supposed to create the test?
igarny [3:57 PM] the sender
andy [4:00 PM] looks like task is created but it has no runs
[4:00] task is here: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef
[4:01] lack of runs is here: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef/runs
igarny [4:01 PM] added this Plain Text snippet pscheduler task latencybg --source psmp-gn-owd-01.gen.ch.geant.net --packet-padding 0 --dest psmp-gn-owd-01-ath-gr-v4.geant.net --ip-version 4 --packet-interval 0.1 --duration PT400S --packet-count 1200 Submitting task... Task URL: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/54aa6b48-e577-4940-993e-6a2ca5f24abf Running with tool 'powstream' Fetching first run... psmp-gn-owd-01.gen.ch.geant.net never scheduled a run for the task. Add Comment Click to expand inline 8 lines
andy [4:02 PM] ok, that's consistent with the one from the mesh-config
[4:02] maybe try restarting pscheduler-scheduler?
[4:02] it should at least be creating the first run
igarny [4:02 PM] yes ...thanks
[4:03] at least I have a lead now...
andy [4:03 PM] fwiw, this is the query i used to check if pscheduler knew about the task: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks?json={%22tool%22:%22powstream%22,%20%22test%22:{%22spec%22:{%22dest%22:%22psmp-gn-owd-01-ath-gr-v6.geant.net%22}}}
mfeit [4:04 PM] I have to bolt, but turning on debug in the scheduler would be useful.
igarny [4:04 PM] https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/751e28fe-ff2a-40c1-9c75-6a7f32390df4/runs/first
[4:04] but it is taking too long to respond
andy [4:04 PM] if you do /first it blocks until there is a run or you timeout
igarny [4:04 PM] nevertheless pscheduler monitor works
andy [4:05 PM] you can also do a refresh on https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef/runs. it should be non-empty if the pscheduler-scheduler is working like expected
igarny [4:06 PM] the problem with this is that I am running this task from CLI
[4:06] and for 400s only
andy [4:07 PM] the task i linked to above doesn't end until tomorrow, so if the pscheduler-server comes back and is working properly, it should fill-in a run
igarny [4:08 PM] added this Plain Text snippet Jul 21 14:01:18 psmp-gn-bw-01-gen-ch runner INFO 2233259: With iperf3: throughput --omit PT10S --duration PT20S --source psmall-poz1.man.poznan.pl --ip-version 6 --dest psmp-gn-bw-01-gen-ch.geant.net --parallel 1 Jul 21 14:01:52 psmp-gn-bw-01-gen-ch runner INFO 2233259: Run succeeded. Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Debug signal ignored; already not debugging Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Debug discontinued Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Assistance is from localhost Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Forcing default slip of PT5M Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Converting to spec via https://localhost/pscheduler/tests/latencybg/spec Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Fetching participant list Jul 21 14:01:57 psmp-gn-bw-01-gen-ch task DEBUG Spec is: {"dest": "psmp-gn-owd-01-ath-gr-v4.geant.net", "packet-padding": 0, "source": "psmp-gn-owd-01.gen.ch.geant.net", "ip-version": 4, "packet-interval": 0.10000000000000001, "duration": "PT400S", "packet-count": 1200, "schema": 1} Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Got participants: {u'participants': [u'psmp-gn-owd-01.gen.ch.geant.net']} Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Lead is psmp-gn-owd-01.gen.ch.geant.net Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Pinging https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/ Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG psmp-gn-owd-01.gen.ch.geant.net is up Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Posting task to https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks Jul 21 14:01:58 psmp-gn-bw-01-gen-ch task DEBUG Data is {"test": {"type": "latencybg", "spec": {"dest": "psmp-gn-owd-01-ath-gr-v4.geant.net", "packet-padding": 0, "source": "psmp-gn-owd-01.gen.ch.geant.net", "ip-version": 4, "packet-interval": 0.10000000000000001, "duration": "PT400S", "packet-count": 1200, "schema": 1}}, "schedule": {"slip": "PT5M"}, "schema": 1} Jul 21 14:02:00 psmp-gn-bw-01-gen-ch task DEBUG Posted https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/751e28fe-ff2a-40c1-9c75-6a7f32390df4 Jul 21 14:02:00 psmp-gn-bw-01-gen-ch task DEBUG Fetching https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/751e28fe-ff2a-40c1-9c75-6a7f32390df4/runs/first Jul 21 14:05:43 psmp-gn-bw-01-gen-ch scheduler INFO Started Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Debug signal ignored; already not debugging Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Debug discontinued Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Assistance is from localhost Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Forcing default slip of PT5M Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Converting to spec via https://localhost/pscheduler/tests/latencybg/spec Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Fetching participant list Jul 21 14:05:51 psmp-gn-bw-01-gen-ch task DEBUG Spec is: {"dest": "psmp-gn-owd-01-ath-gr-v4.geant.net", "packet-padding": 0, "source": "psmp-gn-owd-01.gen.ch.geant.net", "ip-version": 4, "packet-interval": 0.10000000000000001, "duration": "PT400S", "packet-count": 1200, "schema": 1} Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Got participants: {u'participants': [u'psmp-gn-owd-01.gen.ch.geant.net']} Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Lead is psmp-gn-owd-01.gen.ch.geant.net Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Pinging https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/ Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG psmp-gn-owd-01.gen.ch.geant.net is up Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Posting task to https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks Jul 21 14:05:52 psmp-gn-bw-01-gen-ch task DEBUG Data is {"test": {"type": "latencybg", "spec": {"dest": "psmp-gn-owd-01-ath-gr-v4.geant.net", "packet-padding": 0, "source": "psmp-gn-owd-01.gen.ch.geant.net", "ip-version": 4, "packet-interval": 0.10000000000000001, "duration": "PT400S", "packet-count": 1200, "schema": 1}}, "schedule": {"slip": "PT5M"}, "schema": 1} Jul 21 14:05:55 psmp-gn-bw-01-gen-ch task DEBUG Posted https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/a63d1af1-fd85-4d64-b8b8-75095eb6db2f Jul 21 14:05:55 psmp-gn-bw-01-gen-ch task DEBUG Fetching https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/a63d1af1-fd85-4d64-b8b8-75095eb6db2f/runs/first Jul 21 14:06:16 psmp-gn-bw-01-gen-ch runner INFO 2233421: Running https://psmp-gn-bw-01-gen-ch.geant.net/pscheduler/tasks/4f4e1903-550b-4b2f-92cc-8fdba8fd4c82/runs/bc5a076a-6277-4f93-b133-3a805d5df0fe Add Comment Click to expand inline 35 lines
igarny [4:08 PM] the psc log show now signs of service being blocked
[4:08] I have restarted it already
[4:09] nevertheless the 3 CLI attempt ended with "never scheduled a run for this task"
andy [4:09 PM]
k, did you do a pscheduler debug on scheduler
as well per Mark's suggestion? (edited)
[4:10] that may or may not show us something useful
igarny [4:10 PM] I am runnign the tasks with debug, but now I will start a full debug
andy [4:11 PM] yeah, probably debug on the server side would be helpful too. just turing it on for the scheduler is probably enough to start with since its the one responsible for creating runs
[4:11] hey, i see a run now here: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef/runs
[4:12] so looks like something you just did fixed it
igarny [4:12 PM] my 4th CLI attempt though failed with the same error
[4:13] I have started the debug mode only
[4:13] apart from the CLI, the powstream processes are back
andy [4:13 PM]
if you do a ps auxw | grep e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef
do you get back any powstream commands?
igarny [4:14 PM] yes
[4:14] 3 of them
[4:14] ps auxw | grep e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef 601 707 0.0 0.0 7096 416 ? Ss 14:13 0:00 /usr/bin/powstream -p -d /var/lib/pscheduler/tool/powstream/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef-2017Jul21T141100367455 -s 0 -c 1200 -i 0.1 -6 -t -S psmp-gn-owd-01.gen.ch.geant.net [psmp-gn-owd-01-ath-gr-v6.geant.net]:861 1412959200 2238 0.0 0.0 0 0 pts/0 S+ 14:14 0:00 grep e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef 601 60614 0.0 0.0 7096 1156 ? S 14:11 0:00 /usr/bin/powstream -p -d /var/lib/pscheduler/tool/powstream/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef-2017Jul21T141100367455 -s 0 -c 1200 -i 0.1 -6 -t -S psmp-gn-owd-01.gen.ch.geant.net [psmp-gn-owd-01-ath-gr-v6.geant.net]:861 601 60648 0.0 0.0 7052 284 ? Ss 14:11 0:00 /usr/bin/powstream -p -d /var/lib/pscheduler/tool/powstream/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef-2017Jul21T141100367455 -s 0 -c 1200 -i 0.1 -6 -t -S psmp-gn-owd-01.gen.ch.geant.net [psmp-gn-owd-01-ath-gr-v6.geant.net]:861
andy [4:14 PM] ok, perfect that means powstream is running
igarny [4:14 PM] yes...I will go to the next server to apply the fix
andy [4:14 PM] i also see a result now: https://psmp-gn-owd-01.gen.ch.geant.net/pscheduler/tasks/e6c0e7ed-1b8a-4de6-8f07-ecf112cbedef/runs/3c8a2fa5-861a-46f0-b0bc-66cc3ef862d2
[4:15]
if you do a cli command now does it scheduler a run? even if you do a simple pscheduler task rtt --dest es.net
we should get an idea (edited)
[4:16] basically just curious if the scheduler was stuck before or if it is just waaay behind on adding runs (edited)
igarny [4:16 PM] so the CLI with rtt test succeeded
[4:17] I will try that on the second server
andy [4:17 PM] ok, so must have been stuck before. all you did was restart pscheduler-scheduler and turn debug on?
igarny [4:17 PM] yes
[4:17] well several CLI attempts as well
andy [4:18 PM] hmmm ok, when @mfeit gets back we should maybe see if there is any other diagnostics he wants you to grab from the logs.
igarny [4:18 PM] added this Plain Text snippet pscheduler task rtt --dest es.net Submitting task... Task URL: https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/1a234791-a74f-428d-b944- 1ec1264e581b Running with tool 'ping' Fetching first run... psmp-gn-bw-01-lon-uk.geant.net never scheduled a run for the task. Add Comment Click to expand inline 8 lines
igarny [4:18 PM] London server
andy [4:19 PM] ok, so same issue like you suspected
igarny [4:19 PM] added and commented on this Plain Text snippet no Jul 21 14:15:19 psmp-gn-bw-01-lon-uk runner INFO 2428566: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/39b3c942-7515-4193-bfc2-ebcbab496b0c/runs/21db5b41-2d86-42cb-b0ef-c4f5d84b6216 Jul 21 14:15:19 psmp-gn-bw-01-lon-uk runner INFO 2428566: With iperf3: throughput --duration PT30S --source psmp-gn-bw-01-lis-pt-v6.geant.net --ip-version 6 --dest psmp-gn-bw-01-lon-uk.geant.net --parallel 1 Jul 21 14:15:52 psmp-gn-bw-01-lon-uk runner INFO 2428566: Run succeeded. Jul 21 14:17:38 psmp-gn-bw-01-lon-uk runner INFO 2428573: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/eddad4dd-761a-41d0-b62d-76c2c7396d69/runs/7c12c527-ca43-40d8-b129-5a205b9e2c18 Jul 21 14:17:38 psmp-gn-bw-01-lon-uk runner INFO 2428573: With iperf3: throughput --duration PT30S --source psmp-gn-bw-01-vie-at-v4.geant.net --ip-version 4 --dest psmp-gn-bw-01-lon-uk.geant.net --parallel 1 Jul 21 14:18:11 psmp-gn-bw-01-lon-uk runner INFO 2428573: Run succeeded. 1 Comment Click to expand inline 6 lines the logs of pscheduler
igarny [4:19 PM] psc logs is in this state for a min already
andy [4:20 PM] my guess is on the first server the pscheduler-scheduler restart is what actually fixed it, probably just took a minute to see that fact since it was likely busy catching up on all the missed runs
[4:20] not sure why it got stuck to begin with though
igarny [4:21 PM] so on London server I still have not restarted psc-sched
[4:21] but the second run of rtt from CLI with debug mode results in events in the logs
[4:21] still the outcome is the same
[4:21] the task is blocked
[4:22] my question to you and Mark would be, how do I detect it....
[4:23] should I be running regular rtt tasks, so that I know when to restart the daemon
[4:23] in the event log of psc I just got an update
igarny [4:23 PM] added this Plain Text snippet Jul 21 14:17:38 psmp-gn-bw-01-lon-uk runner INFO 2428573: With iperf3: throughput --duration PT30S --source psmp-gn-bw-01-vie-at-v4.geant.net --ip-version 4 --dest psmp-gn-bw-01-lon-uk.geant.net --parallel 1 Jul 21 14:18:11 psmp-gn-bw-01-lon-uk runner INFO 2428573: Run succeeded. Jul 21 14:19:44 psmp-gn-bw-01-lon-uk runner INFO 2428370: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/893c9759-9d36-43ea-b1c5-fd2e23a0b204/runs/0040d207-e606-4363-8392-5ce122cb54cd Jul 21 14:19:44 psmp-gn-bw-01-lon-uk runner INFO 2428370: With iperf3: throughput --duration PT20S --source ps.test.manlan.internet2.edu --dest 62.40.106.131 --parallel 1 Jul 21 14:20:08 psmp-gn-bw-01-lon-uk runner INFO 2428370: Run succeeded. Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Debug signal ignored; already not debugging Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Debug discontinued Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Assistance is from localhost Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Forcing default slip of PT5M Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Converting to spec via https://localhost/pscheduler/tests/rtt/spec Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Fetching participant list Jul 21 14:20:11 psmp-gn-bw-01-lon-uk task DEBUG Spec is: {"dest": "es.net", "schema": 1} Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Got participants: {u'participants': [None], u'null-reason': u'No source specified'} Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Lead is None Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Null reason is No source specified Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Pinging https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/ Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG psmp-gn-bw-01-lon-uk.geant.net is up Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Posting task to https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks Jul 21 14:20:14 psmp-gn-bw-01-lon-uk task DEBUG Data is {"test": {"type": "rtt", "spec": {"dest": "es.net", "schema": 1}}, "schedule": {"slip": "PT5M"}, "schema": 1} Jul 21 14:20:20 psmp-gn-bw-01-lon-uk task DEBUG Posted https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/c77a6d33-0fd1-406b-95fd-0e7020664a71 Jul 21 14:20:20 psmp-gn-bw-01-lon-uk task DEBUG Fetching https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/c77a6d33-0fd1-406b-95fd-0e7020664a71/runs/first Jul 21 14:21:00 psmp-gn-bw-01-lon-uk runner INFO 2428475: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/62c9cc08-da30-4a46-9f56-11775fd82d18/runs/5acc890f-c04f-4116-b98b-4f548f9adb3a Jul 21 14:21:00 psmp-gn-bw-01-lon-uk runner INFO 2428475: With iperf3: throughput --omit PT5S --duration PT10S --source lond-pt1.es.net --dest psmp-gn-bw-01-lon-uk.geant.net --window-size 67108864 --parallel 1 Jul 21 14:21:18 psmp-gn-bw-01-lon-uk runner INFO 2428475: Run succeeded. Jul 21 14:21:28 psmp-gn-bw-01-lon-uk runner INFO 2428393: Running https://psmp-gn-bw-01-lon-uk.geant.net/pscheduler/tasks/5ee0fb22-49ad-4b0a-8a67-d9f683f8a291/runs/53ca04fc-426f-48e2-ad4a-b2f8c0cdb0cf Jul 21 14:21:28 psmp-gn-bw-01-lon-uk runner INFO 2428393: With iperf3: throughput --omit PT10S --duration PT20S --source perfsonar-nas.asnet.am --ip-version 4 --dest psmp-gn-bw-01-lon-uk.geant.net --parallel 1 Jul 21 14:22:02 psmp-gn-bw-01-lon-uk runner INFO 2428393: Run succeeded. Add Comment Collapse
andy [4:24 PM] no, the better question is why it happened at all. that should not happen
igarny [4:24 PM] meaning tasks that had been scheduled are being executed
[4:24] but new tasks are not being added
andy [4:25 PM] the task you just showed has a different source i think. also anything already on the schedule would continue to run fine
igarny [4:25 PM] also I am not able to make any conclusion based on the psc log
[4:25] yes you are right about the task
andy [4:26 PM]
can you look up the pid of the pscheduler-server process and try running gstack PID
?
[4:26] that should give a stack trace of what the pscheduler-server is doing
igarny [4:27 PM] I do not have that command on the server
[4:27] gstack 3517 -bash: gstack: command not found
[4:27] I do not have privileged access as well
[4:28] I am not supposed to manipulate the config of the servers outside the Puppet automation
andy [4:28 PM] ah ok, yeah its not standard. you could grab it with yum
mfeit [4:28 PM] I have a meeting in two minutes and a band thing to go to right afterward, so I won't be able to look this over until I get back.
igarny [4:28 PM] thanks Mark. NP
[4:29] it is one of these tricky things
mfeit [4:29 PM] Hopefully it's something simple.
igarny [4:29 PM] I am just looking for a way to understand how to detect it, so tat I can automate the restart of service
andy [4:33 PM] yeah, it seems like you are creating unnecessary work for yourself. that shouldn't be happening. i have 50 of these things deployed on esnet and can't remember the last time i had to restart any pscheduler stuff by hand. you could do a nightly restart if it'd make you feel better, but probably doing more than that is wasted cycles IMO. it'd be better to know what's causing it
igarny [4:34 PM] I can only tell these servers are added in several meshes
[4:35] pscheduler ping psmp-gn-bw-01-lon-uk pScheduler on psmp-gn-bw-01-lon-uk is alive.
[4:36] if I run from a remote host it states it is alive....but of course ...we can read the task descr
[4:36] Thanks Andy
[4:36] will try finding a clue
mfeit [4:59 PM] @igarny If there's a bug, I want to fix it so you don't have to detect anything and restart the service.
igarny [5:00 PM] yes of course... I have no doubts about it...
[5:00] I will open a ticket with an observation of mine
[5:00] or should I just type it here
mfeit [5:00 PM] Paste this conversation in there, too, please.
igarny [5:00 PM
I'm going to assume this has been resolved since it's been over five years.
Problem: there are seldom cases, where locally initiated tasks are not being scheduled. Error message from CLI is: “never scheduled a run for the task.” (see below) The weirdness about it is that the task is a single participant with local node as source and is a non-blocking one.
pscheduler task --debug rtt --dest es.net