Closed rhclopes closed 2 months ago
The second attached file psconfig.txt
and the elmond
The host page is
I expected to see in it all hosts from the UK mesh, some 15 hosts.
This is because you're on nightly and hit an upgrade path normal users won't see while we were testing an upgrade to a new version of OpenSearch. 5.1.0 beta will be announced shortly, so you can either do a reinstall from staging or try the following commands to repair your install from nightly:
/usr/lib/perfsonar/archive/perfsonar-scripts/pselastic_secure_pre.sh
systemctl restart opensearch
systemctl restart logstash
/usr/lib/perfsonar/archive/perfsonar-scripts/pselastic_secure_pos.sh
Elmond and postgresql are not gone in 5.1.0. postgresql is fundamental to pscheduler and elmond is not getting removed in 5.1.0.
It is expected that psconfig remote will show both agents on a toolkit.
I have run the commands above.
Are they documented in the perfSONAR pages?
After running them (2h ago) I still see only two entries for test result summary in
Am I missing another config?
Looks like your archive is working the way it is supposed to. On the Host Info page I see that your psconfig-pscheduler-agent seems to be failing every hour (which would be when it downloads the JSON to run). Any error in /var/log/perfsonar/psconfig-pscheduler-agent.log? there may also be something in journalctl -xe
is it is dying for some other reason. I feel like maybe we debugged an issue like this before...
We had noticed that job stopping at every hour. UJnfortunately, the journal doesn't have anything
Apr 18 18:00:07 ps-slough-bw.perf.ja.net systemd[1]: Starting pSConfig PScheduler Agent...
░░ Subject: A start job for unit psconfig-pscheduler-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A start job for unit psconfig-pscheduler-agent.service has begun execution.
░░
░░ The job identifier is 1501062.
Apr 18 18:00:07 ps-slough-bw.perf.ja.net systemd[1]: Started pSConfig PScheduler Agent.
░░ Subject: A start job for unit psconfig-pscheduler-agent.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A start job for unit psconfig-pscheduler-agent.service has finished successfully.
░░
░░ The job identifier is 1501062.
Apr 18 19:00:02 ps-slough-bw.perf.ja.net systemd[1]: Stopping pSConfig PScheduler Agent...
░░ Subject: A stop job for unit psconfig-pscheduler-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A stop job for unit psconfig-pscheduler-agent.service has begun execution.
░░
░░ The job identifier is 1503422.
Apr 18 19:00:02 ps-slough-bw.perf.ja.net systemd[1]: psconfig-pscheduler-agent.service: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
Apr 18 18:00:07 ps-slough-bw.perf.ja.net systemd[1]: Starting pSConfig PScheduler Agent...
░░ Subject: A start job for unit psconfig-pscheduler-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A start job for unit psconfig-pscheduler-agent.service has begun execution.
░░
░░ The job identifier is 1501062.
Apr 18 18:00:07 ps-slough-bw.perf.ja.net systemd[1]: Started pSConfig PScheduler Agent.
░░ Subject: A start job for unit psconfig-pscheduler-agent.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A start job for unit psconfig-pscheduler-agent.service has finished successfully.
░░
░░ The job identifier is 1501062.
Apr 18 19:00:02 ps-slough-bw.perf.ja.net systemd[1]: Stopping pSConfig PScheduler Agent...
░░ Subject: A stop job for unit psconfig-pscheduler-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░
░░ A stop job for unit psconfig-pscheduler-agent.service has begun execution.
░░
░░ The job identifier is 1503422.
Apr 18 19:00:02 ps-slough-bw.perf.ja.net systemd[1]: psconfig-pscheduler-agent.service: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
The /var/log/messages, however, has an error at the start of every hour
Apr 18 19:59:04 ps-slough-bw grafana[320146]: logger=context userId=0 orgId=1 uname= t=2024-04-18T19:59:04.236810414+01:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=400 remote_addr=2a02:8010:65d4:0:97ec:14be:b952:5c1 time_ms=1 duration=1.336288ms size=12 referer= handler=/api/live/ws status_source=server
Apr 18 19:59:07 ps-slough-bw perfsonar_host_exporter[320133]: 127.0.0.1 - - [18/Apr/2024 19:59:07] "GET / HTTP/1.1" 200 -
Apr 18 19:59:10 ps-slough-bw python3[1829929]: begin_ts=1713463150, end_ts=None
Apr 18 19:59:10 ps-slough-bw python3[1829929]: begin=2024-04-18 17:59:10
Apr 18 19:59:10 ps-slough-bw python3[1829929]: filters: [{'range': {'pscheduler.start_time': {'gte': datetime.datetime(2024, 4, 18, 17, 59, 10)}}}, {'bool': {'should': [{'term': {'test.type': 'trace'}}]}}, {'bool': {'should': [{'term': {'meta.source.ip': '2001:630:3c:f803::a'}}]}}, {'bool': {'should': [{'term': {'meta.destination.ip': '2001:630:a5:1200::a:3'}}]}}]
Apr 18 19:59:10 ps-slough-bw python3[1829929]: 2024-04-18 19:59:10,692 - app - ERROR - Exception on / [GET]
Apr 18 19:59:10 ps-slough-bw python3[1829929]: Traceback (most recent call last):
This seems to be an issue in Elmond
Sorry, I should have attached that.
When I try to upgrade the perfSONAR 5.1 nightly to 5.1 beta1 I don't get the perfsonar-toolkit. I believe that the for the nightly is shadowing the beta version.
I have now upgrade from nightly to beta1.
I had corrected before upgrading that error with psconfig-pscheduler-agent that breaks at every hour. I believe that error comes from elmond.
AFter the upgrade to beta1 the error came back. So, I ran the command Andy recommended yesterday, and it spits an error
[root@ps-slough-bw perfsonar]# /usr/lib/perfsonar/archive/perfsonar-scripts/pselastic_secure_pre.sh /usr/lib/perfsonar/archive/perfsonar-scripts/pselastic_secure_pre.sh: line 43: [: ==: unary operator expected
It's easy to fix by using "$INSTALL" in that line instead of $INSTALL.
I use ShellCheck on any bash with more than 5 lines. If you run shellcheck on that scripts you get dozens of suggestions that can potentially improve it.
Maybe I am being naive, but I've seen big cloud provider devs praising ShellCheck.
Elmond is still an error.
[root@ps-slough-bw log]# systemctl status elmond
● elmond.service - perfSONAR Elmond
Loaded: loaded (/usr/lib/systemd/system/elmond.service; enabled; preset: disabled)
Active: active (running) since Fri 2024-04-19 19:28:31 BST; 22s ago
Main PID: 30010 (python3)
Tasks: 1 (limit: 201504)
Memory: 28.2M
CPU: 350ms
CGroup: /system.slice/elmond.service
└─30010 /usr/bin/python3 /usr/lib/perfsonar/elmond/app.py
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: File "/usr/lib/python3.9/site-packages/opensearchpy/client/__init__.py", line 1742, in search
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: return self.transport.perform_request(
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: File "/usr/lib/python3.9/site-packages/opensearchpy/transport.py", line 446, in perform_request
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: raise e
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: File "/usr/lib/python3.9/site-packages/opensearchpy/transport.py", line 409, in perform_request
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: status, headers_response, data = connection.perform_request(
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: File "/usr/lib/python3.9/site-packages/opensearchpy/connection/http_urllib3.py", line 276, in perform_request
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: raise SSLError("N/A", str(e), e)
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: opensearchpy.exceptions.SSLError: ConnectionError([Errno 13] Permission denied) caused by: SSLError([Errno 13] Per>
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: 2024-04-19 19:28:51,502 - werkzeug - INFO - 127.0.0.1 - - [19/Apr/2024 19:28:51] "GET /?source=ps-slough-lat.perf.>
WRT elmond try running:
/usr/lib/perfsonar/archive/perfsonar-scripts/elmond_configuration.sh
systemctl restart elmond
Again this is an artifact of being on nightly while we were testing opensearch upgrade. That should get certificates and credentials used by elmond back in-sync.
I am much more interested in the psconfig restart issue. psconfig has no interaction with elmond, so that should not be the issue. We did make one fix today that corrects a problem where it was not detecting all the possible hostnames of a host, so some tests would not get created, but nothing that we had observed that was causing it to fail. We have not done another tag, but will probably do so next week which will get that fix into staging. Maybe we can debug at our next meeting or you and I can find some time to hop on zoom and look at it closer.
Running that elmond configuration and restart didn't help.
[root@ps-slough-bw raul]# /usr/lib/perfsonar/archive/perfsonar-scripts/elmond_configuration.sh
[root@ps-slough-bw raul]# systemctl restart elmond
[root@ps-slough-bw raul]# systemctl status elmond
● elmond.service - perfSONAR Elmond
Loaded: loaded (/usr/lib/systemd/system/elmond.service; enabled; preset: disabled)
Active: active (running) since Sat 2024-04-20 10:58:08 BST; 4s ago
Main PID: 273635 (python3)
Tasks: 1 (limit: 201504)
Memory: 28.1M
CPU: 337ms
CGroup: /system.slice/elmond.service
└─273635 /usr/bin/python3 /usr/lib/perfsonar/elmond/app.py
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: File "/usr/lib/python3.9/site-packages/opensearchpy/client/__init__.py", line 1742, in search
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: return self.transport.perform_request(
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: File "/usr/lib/python3.9/site-packages/opensearchpy/transport.py", line 446, in perform_request
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: raise e
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: File "/usr/lib/python3.9/site-packages/opensearchpy/transport.py", line 409, in perform_request
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: status, headers_response, data = connection.perform_request(
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: File "/usr/lib/python3.9/site-packages/opensearchpy/connection/http_urllib3.py", line 276, in perform_request
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: raise SSLError("N/A", str(e), e)
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: opensearchpy.exceptions.SSLError: ConnectionError([Errno 13] Permission denied) caused by: SSLError([Errno 13] Pe>
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: 2024-04-20 10:58:09,135 - werkzeug - INFO - 127.0.0.1 - - [20/Apr/2024 10:58:09] "GET /?source=ps-slough-lat.perf>
The error is showing at every minute.
I am available for a debugging session on Monday or Tuesday. You choose the time.
I am interesting in installing 5.1 through update because that is the path most institutes in the WLCG will follow.
The two fixes made yesterday definitely corrected those two errors: elmond is clean, and psceduler-agent is not being downed at every hour.
Yet, opensearch doesn't seem to have recent data.
When I run
curl -k -u pscheduler_reader:xxxx https://localhost:9200/pscheduler*/_search |jq|tail
"schedule": {
"start": "2024-03-11T19:38:21Z",
"until": "2024-03-12T19:38:21Z"
},
"host": "127.0.0.1"
}
}
]
}
}
12 March is the data when 5.1 nightly was installed.
the logs for dnf show
2024-03-12T12:53:35+0000 DEBUG ---> Package libperfsonar-toolkit-perl.noarch 5.0.8-1.el9 will be upgraded
2024-03-12T12:53:35+0000 DEBUG ---> Package libperfsonar-toolkit-perl.noarch 5.1.0-0.20240304215449.el9 will be an upgrade
From that moment, the local opensearch doesn't seem to have and the central archive also seems to stop have data for tests starting in this host.
Am I reading in the wrong place?
And psconfig shows
[root@ps-slough-bw log]# psconfig remote list
=== pScheduler Agent ===
[
{
"url": "https://psconfig.opensciencegrid.org/pub/auto/ps-slough-bw.perf.ja.net",
"configure-archives": true
},
{
"url": "https://psconfig.opensciencegrid.org/pub/auto/ps-slough-lat.perf.ja.net",
"configure-archives": true
}
]
=== Grafana Agent ===
[
{
"url": "https://psconfig.opensciencegrid.org/pub/auto/ps-slough-bw.perf.ja.net",
"configure-archives": true
},
{
"url": "https://psconfig.opensciencegrid.org/pub/auto/ps-slough-lat.perf.ja.net",
"configure-archives": true
}
]
After update to 5.1b, all entriews in opensearch.log are like
[2024-04-25T18:15:06,839][WARN ][o.o.p.c.u.JsonConverter ] [ps-slough-bw.perf.ja.net] Json Mapping Error: Cannot invoke "java.lang.Long.longValue()" because "this.cacheMaxSize" is null (through reference chain: org.opensearch.performanceanalyzer.collectors.CacheConfigMetricsCollector$CacheMaxSizeStatus["Cache_MaxSize"])
That error doesn't exist in 5.08.
When I run
wget https://psconfig.opensciencegrid.org/pub/auto/ps-slough-lat.perf.ja.net
The file shows
"host-archive48": {
"archiver": "esmond",
"data": {
"url": "https://ps-slough-lat.perf.ja.net/esmond/perfsonar/archive",
"measurement-agent": "{% scheduled_by_address %}",
"schema": 1
}
},
Is that right? the host ps-slough is on 5.1b. It doesn't have Esmond, right?
I have been testing the perfsonar toolkit pre 5.1 for a few weeks. After the last nightly updafe, opensearch may be broken. I don't results published for tests starting in my host
https://monit-grafana-open.cern.ch/d/-SMRsAx7z/perfsonar-h2h-performance?orgId=16&var-src_bw=ps-slough-bw.perf.ja.net&var-src_lat=ps-slough-lat.perf.ja.net&var-dst_bw=ps-bw.tier2.hep.manchester.ac.uk&var-dst_lat=ps.tier2.hep.manchester.ac.uk&var-dst_bw1=perfsonar-bandwidth.esc.qmul.ac.uk&var-dst_lat1=perfsonar-latency.esc.qmul.ac.uk&var-dst_bw2=ps-london-bw.perf.ja.net&var-dst_lat2=ps-london-lat.perf.ja.net
The opensearch logs contain exclusively errors and warnings. Please see attached file.
ps52.-update.txt
I also noticed that almond is still installed and active together with postgresql. I expected them to gone.
Last, psconfig remote list shows two sets of urls: one for scheduler agent and one for Grafana agent. Expected? Please. see attached file