perfsonar / mesh-config

Centralized configuration framework for measurement points and GUIs
Apache License 2.0
2 stars 0 forks source link

latencybg tasks get forgotten = not started until next 24h period #113

Closed igarny closed 6 years ago

igarny commented 6 years ago

Hello Andy,

I am observing these for some time, but kind of now have found a way to report them.

From the pscheduler log: the problem:

@psmp-gn-bw-02-fra-de ~]$ grep "With powstream: latencybg --dest psmp-gn-owd-01.fra.de" pscheduler.log
Apr  8 03:25:32 psmp-gn-bw-02-fra-de runner INFO     625816: With powstream: latencybg --dest psmp-gn-owd-01.fra.de.geant.net --packet-padding 0 --source psmp-gn-owd-02.fra.de.geant.net --ip-version 6 --packet-interval 0.1 --duration PT86400S --packet-count 1200
Apr  8 03:25:32 psmp-gn-bw-02-fra-de runner INFO     626038: With powstream: latencybg --dest psmp-gn-owd-01.fra.de.geant.net --packet-padding 0 --source psmp-gn-owd-02.fra.de.geant.net --ip-version 4 --packet-interval 0.1 --duration PT86400S --packet-count 1200

the running:

@psmp-gn-bw-02-fra-de ~]$ grep "With powstream: latencybg --dest psmp-gn-owd-01.sof.bg" pscheduler.log
Apr  8 03:25:32 psmp-gn-bw-02-fra-de runner INFO     626178: With powstream: latencybg --dest psmp-gn-owd-01.sof.bg.geant.net --packet-padding 0 --source psmp-gn-owd-02.fra.de.geant.net --ip-version 4 --packet-interval 0.1 --duration PT86400S --packet-count 1200
Apr  9 03:25:32 psmp-gn-bw-02-fra-de runner INFO     666369: With powstream: latencybg --dest psmp-gn-owd-01.sof.bg.geant.net --packet-padding 0 --source psmp-gn-owd-02.fra.de.geant.net --ip-version 6 --packet-interval 0.1 --duration PT86400S --packet-count 1200
Apr  9 03:25:32 psmp-gn-bw-02-fra-de runner INFO     666658: With powstream: latencybg --dest psmp-gn-owd-01.sof.bg.geant.net --packet-padding 0 --source psmp-gn-owd-02.fra.de.geant.net --ip-version 4 --packet-interval 0.1 --duration PT86400S --packet-count 1200

No direct errors observed in:

@psmp-gn-bw-02-fra-de ~]$ less /var/log/perfsonar/owamp
owamp_bwctl.log              owamp_bwctl.log-20180318.gz  owamp_bwctl.log-20180325.gz  owamp_bwctl.log-20180401.gz  owamp_bwctl.log-20180408.gz  owampd_cleaner_error.log

Again the issue:

@psmp-gn-bw-02-fra-de ~]$ date
Mon Apr  9 09:48:05 UTC 2018
@psmp-gn-bw-02-fra-de ~]$ grep "With powstream: latencybg --dest psmp-gn-owd-01.fra.de" pscheduler.log
Apr  8 03:25:32 psmp-gn-bw-02-fra-de runner INFO     625816: With powstream: latencybg --dest psmp-gn-owd-01.fra.de.geant.net --packet-padding 0 --source psmp-gn-owd-02.fra.de.geant.net --ip-version 6 --packet-interval 0.1 --duration PT86400S --packet-count 1200
Apr  8 03:25:32 psmp-gn-bw-02-fra-de runner INFO     626038: With powstream: latencybg --dest psmp-gn-owd-01.fra.de.geant.net --packet-padding 0 --source psmp-gn-owd-02.fra.de.geant.net --ip-version 4 --packet-interval 0.1 --duration PT86400S --packet-count 1200

The meshconfig events:

2018/04/08 09:43:40 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 10:43:41 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 11:43:48 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 22 new tasks, and deleted 0 old tasks
2018/04/08 12:43:44 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 13:43:45 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 14:43:40 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 15:43:54 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 16:43:43 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 17:43:42 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 18:43:47 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 19:43:42 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 21:10:31 (52312) WARN> perfsonar_meshconfig_agent:430 main:: - Problem adding test latencybg(psmp-gn-owd-02.fra.de.geant.net->psmp-gn-owd-01-mil2-it-v4.geant.net), continuing with rest of config: 500 Internal Server Error: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator,
 root@localhost and inform them of the time the error occurred,
and anything you might have done that may have
caused the error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>
2018/04/08 21:10:31 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 242 new tasks, and deleted 0 old tasks
2018/04/08 21:44:20 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 24 new tasks, and deleted 0 old tasks
2018/04/08 22:43:56 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/08 23:43:45 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/09 00:43:45 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/09 01:43:49 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/09 02:44:05 (52312) INFO> perfsonar_meshconfig_agent:438 main:: - Added 22 new tasks, and deleted 0 old tasks
2018/04/09 03:25:56 (54385) WARN> perfsonar_meshconfig_agent:430 main:: - Problem adding test trace(psmp-gn-owd-01-pra-cz-v6.geant.net->psmp-gn-owd-02.fra.de.geant.net), continuing with rest of config: 500 INTERNAL SERVER ERROR: Unable to determine participants: Neither the source nor destination is running pScheduler.

2018/04/09 03:25:56 (54385) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/09 04:25:41 (54385) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/09 05:25:40 (54385) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/09 06:25:42 (54385) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/09 07:25:49 (54385) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks
2018/04/09 08:26:01 (54385) INFO> perfsonar_meshconfig_agent:438 main:: - Added 21 new tasks, and deleted 0 old tasks
2018/04/09 09:25:49 (54385) INFO> perfsonar_meshconfig_agent:438 main:: - Added 20 new tasks, and deleted 0 old tasks

My hope is to have this automation in pSconfig, that would identify missing measurement runs of latencybg and attempt to restart them on the next configuration re-read.

Should I open a ticket in pSconfig?

igarny commented 6 years ago

I have found another example ... different from the above. Despite the successful submission of the task pscheduler fails with a strange error. This results to no measurements started, which again leads to the reported problem: "no additional attempts are made to sustain the latencybg tests running"

Apr  9 03:25:33 psmp-gn-bw-02-fra-de runner INFO     626037: With powstream: latencybg --dest psmp-gn-owd-01-ham-de-v4.geant.net --packet-padding 0 --source psmp-gn-owd-02.fra.de.geant.net --ip-version 4 --packet-interval 0.1 --duration PT86400S --packet-count 1200
Apr  9 03:25:33 psmp-gn-bw-02-fra-de runner ERROR    626037: Failed to set running state: Invalid transition between states (5 to 3).
arlake228 commented 6 years ago

Both of your comments are two separate issues. Probably need @mfeit for both of them and may need some pscheduler issues open instead of this one:

  1. I believe the task here is an example of one of them you listed in your first comment: https://psmp-gn-owd-02.fra.de.geant.net/pscheduler/tasks/3aa993ef-600d-49c0-a630-10bdf4e3cd6b?expanded&pretty&detail

mesh-config actually does try to detect when a latencybg fails and recreate it. latencybg tasks can fail a little differently than others, so one special thing it checks is that after the start time of a background-multi task plus some delta, if there is only 1 run, it assumes the task failed. Since it doesn't know anything else about how to read tests, like how frequent a task should be creating runs this is about the best it can do, but should capture most of the failure cases and correct things within about an hour as opposed to 24 hours. It determines this by looking at the task JSON's detail section's runs field. Notice that in the task I shared above runs is actually set to 2. What's strange is if you go to https://psmp-gn-owd-02.fra.de.geant.net/pscheduler/tasks/3aa993ef-600d-49c0-a630-10bdf4e3cd6b/runs, there is only 1 run listed. It seems like that the value provided in "runs" is incorrect. I'm wondering if maybe the runs are getting miscounted by pScheduler? I am actually able to see this is a couple ESnet tasks this morning as well, so its likely a more general issue.

  1. That state transition error is also specific to pScheduler. I suspect that can cause a condition like I describe in item 1 above, but @mfeit is a better one to say whether that is a valid transition and what might be causing it.
arlake228 commented 6 years ago

A helpful query for finding tasks in the state where "runs" says 2 but if you look at the runs list it only shows one is as follows: https://HOSTNAME/pscheduler/tasks?pretty&expanded&detail&json={%22test%22:{%22type%22:%22latencybg%22},"detail":{"runs":2}}

Example: https://lbl-pt1.es.net/pscheduler/tasks?pretty&expanded&detail&json={%22test%22:{%22type%22:%22latencybg%22},"detail":{"runs":2}}

arlake228 commented 6 years ago

Casey Russell is also seeing the run count issue: https://ps-psu-lt.perfsonar.kanren.net/pscheduler/tasks?pretty&expanded&detail&json={%22test%22:{%22type%22:%22latencybg%22},%22detail%22:{%22runs%22:2}}

arlake228 commented 6 years ago

Mark added a new field to help detect stuck processes so they don't hang around for 24 hours. I have tested pretty extensively and have seen this work in production so going to close for now. Fix is part of 4.1.