perfsonar / toolkit

perfSONAR Toolkit distribution environment scripts and GUI
Apache License 2.0
30 stars 5 forks source link

ps 5.1 tests -opensearch broken? #469

Open rhclopes opened 2 months ago

rhclopes commented 2 months ago

I have been testing the perfsonar toolkit pre 5.1 for a few weeks. After the last nightly updafe, opensearch may be broken. I don't results published for tests starting in my host

https://monit-grafana-open.cern.ch/d/-SMRsAx7z/perfsonar-h2h-performance?orgId=16&var-src_bw=ps-slough-bw.perf.ja.net&var-src_lat=ps-slough-lat.perf.ja.net&var-dst_bw=ps-bw.tier2.hep.manchester.ac.uk&var-dst_lat=ps.tier2.hep.manchester.ac.uk&var-dst_bw1=perfsonar-bandwidth.esc.qmul.ac.uk&var-dst_lat1=perfsonar-latency.esc.qmul.ac.uk&var-dst_bw2=ps-london-bw.perf.ja.net&var-dst_lat2=ps-london-lat.perf.ja.net

The opensearch logs contain exclusively errors and warnings. Please see attached file.

ps52.-update.txt

I also noticed that almond is still installed and active together with postgresql. I expected them to gone.

Last, psconfig remote list shows two sets of urls: one for scheduler agent and one for Grafana agent. Expected? Please. see attached file

rhclopes commented 2 months ago

The second attached file psconfig.txt

and the elmond

elmond.txt

The host page is

https://ps-slough-lat.perf.ja.net/grafana/d/ff439cb3-6042-43b5-a999-674681d80395/perfsonar-main?orgId=1

I expected to see in it all hosts from the UK mesh, some 15 hosts.

arlake228 commented 2 months ago

This is because you're on nightly and hit an upgrade path normal users won't see while we were testing an upgrade to a new version of OpenSearch. 5.1.0 beta will be announced shortly, so you can either do a reinstall from staging or try the following commands to repair your install from nightly:

/usr/lib/perfsonar/archive/perfsonar-scripts/pselastic_secure_pre.sh
systemctl restart opensearch
systemctl restart logstash
/usr/lib/perfsonar/archive/perfsonar-scripts/pselastic_secure_pos.sh

Elmond and postgresql are not gone in 5.1.0. postgresql is fundamental to pscheduler and elmond is not getting removed in 5.1.0.

It is expected that psconfig remote will show both agents on a toolkit.

rhclopes commented 2 months ago

I have run the commands above.

Are they documented in the perfSONAR pages?

After running them (2h ago) I still see only two entries for test result summary in

https://ps-slough-lat.perf.ja.net/grafana/d/ff439cb3-6042-43b5-a999-674681d80395/perfsonar-main?orgId=1

Am I missing another config?

arlake228 commented 2 months ago

Looks like your archive is working the way it is supposed to. On the Host Info page I see that your psconfig-pscheduler-agent seems to be failing every hour (which would be when it downloads the JSON to run). Any error in /var/log/perfsonar/psconfig-pscheduler-agent.log? there may also be something in journalctl -xe is it is dying for some other reason. I feel like maybe we debugged an issue like this before...

rhclopes commented 2 months ago

We had noticed that job stopping at every hour. UJnfortunately, the journal doesn't have anything

Apr 18 18:00:07 ps-slough-bw.perf.ja.net systemd[1]: Starting pSConfig PScheduler Agent...
░░ Subject: A start job for unit psconfig-pscheduler-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░ 
░░ A start job for unit psconfig-pscheduler-agent.service has begun execution.
░░ 
░░ The job identifier is 1501062.
Apr 18 18:00:07 ps-slough-bw.perf.ja.net systemd[1]: Started pSConfig PScheduler Agent.
░░ Subject: A start job for unit psconfig-pscheduler-agent.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░ 
░░ A start job for unit psconfig-pscheduler-agent.service has finished successfully.
░░ 
░░ The job identifier is 1501062.
Apr 18 19:00:02 ps-slough-bw.perf.ja.net systemd[1]: Stopping pSConfig PScheduler Agent...
░░ Subject: A stop job for unit psconfig-pscheduler-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░ 
░░ A stop job for unit psconfig-pscheduler-agent.service has begun execution.
░░ 
░░ The job identifier is 1503422.
Apr 18 19:00:02 ps-slough-bw.perf.ja.net systemd[1]: psconfig-pscheduler-agent.service: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
Apr 18 18:00:07 ps-slough-bw.perf.ja.net systemd[1]: Starting pSConfig PScheduler Agent...
░░ Subject: A start job for unit psconfig-pscheduler-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░ 
░░ A start job for unit psconfig-pscheduler-agent.service has begun execution.
░░ 
░░ The job identifier is 1501062.
Apr 18 18:00:07 ps-slough-bw.perf.ja.net systemd[1]: Started pSConfig PScheduler Agent.
░░ Subject: A start job for unit psconfig-pscheduler-agent.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░ 
░░ A start job for unit psconfig-pscheduler-agent.service has finished successfully.
░░ 
░░ The job identifier is 1501062.
Apr 18 19:00:02 ps-slough-bw.perf.ja.net systemd[1]: Stopping pSConfig PScheduler Agent...
░░ Subject: A stop job for unit psconfig-pscheduler-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
░░ 
░░ A stop job for unit psconfig-pscheduler-agent.service has begun execution.
░░ 
░░ The job identifier is 1503422.
Apr 18 19:00:02 ps-slough-bw.perf.ja.net systemd[1]: psconfig-pscheduler-agent.service: Deactivated successfully.
░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://access.redhat.com/support
rhclopes commented 2 months ago

The /var/log/messages, however, has an error at the start of every hour

Apr 18 19:59:04 ps-slough-bw grafana[320146]: logger=context userId=0 orgId=1 uname= t=2024-04-18T19:59:04.236810414+01:00 level=info msg="Request Completed" method=GET path=/api/live/ws status=400 remote_addr=2a02:8010:65d4:0:97ec:14be:b952:5c1 time_ms=1 duration=1.336288ms size=12 referer= handler=/api/live/ws status_source=server
Apr 18 19:59:07 ps-slough-bw perfsonar_host_exporter[320133]: 127.0.0.1 - - [18/Apr/2024 19:59:07] "GET / HTTP/1.1" 200 -
Apr 18 19:59:10 ps-slough-bw python3[1829929]: begin_ts=1713463150, end_ts=None
Apr 18 19:59:10 ps-slough-bw python3[1829929]: begin=2024-04-18 17:59:10
Apr 18 19:59:10 ps-slough-bw python3[1829929]: filters: [{'range': {'pscheduler.start_time': {'gte': datetime.datetime(2024, 4, 18, 17, 59, 10)}}}, {'bool': {'should': [{'term': {'test.type': 'trace'}}]}}, {'bool': {'should': [{'term': {'meta.source.ip': '2001:630:3c:f803::a'}}]}}, {'bool': {'should': [{'term': {'meta.destination.ip': '2001:630:a5:1200::a:3'}}]}}]
Apr 18 19:59:10 ps-slough-bw python3[1829929]: 2024-04-18 19:59:10,692 - app - ERROR - Exception on / [GET]
Apr 18 19:59:10 ps-slough-bw python3[1829929]: Traceback (most recent call last):

This seems to be an issue in Elmond

rhclopes commented 2 months ago

Sorry, I should have attached that.

rhclopes commented 2 months ago

When I try to upgrade the perfSONAR 5.1 nightly to 5.1 beta1 I don't get the perfsonar-toolkit. I believe that the for the nightly is shadowing the beta version.

rhclopes commented 2 months ago

I have now upgrade from nightly to beta1.

I had corrected before upgrading that error with psconfig-pscheduler-agent that breaks at every hour. I believe that error comes from elmond.

AFter the upgrade to beta1 the error came back. So, I ran the command Andy recommended yesterday, and it spits an error

[root@ps-slough-bw perfsonar]# /usr/lib/perfsonar/archive/perfsonar-scripts/pselastic_secure_pre.sh /usr/lib/perfsonar/archive/perfsonar-scripts/pselastic_secure_pre.sh: line 43: [: ==: unary operator expected

It's easy to fix by using "$INSTALL" in that line instead of $INSTALL.

I use ShellCheck on any bash with more than 5 lines. If you run shellcheck on that scripts you get dozens of suggestions that can potentially improve it.

Maybe I am being naive, but I've seen big cloud provider devs praising ShellCheck.

rhclopes commented 2 months ago

Elmond is still an error.

[root@ps-slough-bw log]# systemctl status elmond
● elmond.service - perfSONAR Elmond
     Loaded: loaded (/usr/lib/systemd/system/elmond.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-04-19 19:28:31 BST; 22s ago
   Main PID: 30010 (python3)
      Tasks: 1 (limit: 201504)
     Memory: 28.2M
        CPU: 350ms
     CGroup: /system.slice/elmond.service
             └─30010 /usr/bin/python3 /usr/lib/perfsonar/elmond/app.py

Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]:   File "/usr/lib/python3.9/site-packages/opensearchpy/client/__init__.py", line 1742, in search
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]:     return self.transport.perform_request(
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]:   File "/usr/lib/python3.9/site-packages/opensearchpy/transport.py", line 446, in perform_request
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]:     raise e
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]:   File "/usr/lib/python3.9/site-packages/opensearchpy/transport.py", line 409, in perform_request
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]:     status, headers_response, data = connection.perform_request(
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]:   File "/usr/lib/python3.9/site-packages/opensearchpy/connection/http_urllib3.py", line 276, in perform_request
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]:     raise SSLError("N/A", str(e), e)
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: opensearchpy.exceptions.SSLError: ConnectionError([Errno 13] Permission denied) caused by: SSLError([Errno 13] Per>
Apr 19 19:28:51 ps-slough-bw.perf.ja.net python3[30010]: 2024-04-19 19:28:51,502 - werkzeug - INFO - 127.0.0.1 - - [19/Apr/2024 19:28:51] "GET /?source=ps-slough-lat.perf.>
arlake228 commented 2 months ago

WRT elmond try running:

/usr/lib/perfsonar/archive/perfsonar-scripts/elmond_configuration.sh
systemctl restart elmond

Again this is an artifact of being on nightly while we were testing opensearch upgrade. That should get certificates and credentials used by elmond back in-sync.

I am much more interested in the psconfig restart issue. psconfig has no interaction with elmond, so that should not be the issue. We did make one fix today that corrects a problem where it was not detecting all the possible hostnames of a host, so some tests would not get created, but nothing that we had observed that was causing it to fail. We have not done another tag, but will probably do so next week which will get that fix into staging. Maybe we can debug at our next meeting or you and I can find some time to hop on zoom and look at it closer.

rhclopes commented 2 months ago

Running that elmond configuration and restart didn't help.

[root@ps-slough-bw raul]# /usr/lib/perfsonar/archive/perfsonar-scripts/elmond_configuration.sh
[root@ps-slough-bw raul]# systemctl restart elmond
[root@ps-slough-bw raul]# systemctl status elmond
● elmond.service - perfSONAR Elmond
     Loaded: loaded (/usr/lib/systemd/system/elmond.service; enabled; preset: disabled)
     Active: active (running) since Sat 2024-04-20 10:58:08 BST; 4s ago
   Main PID: 273635 (python3)
      Tasks: 1 (limit: 201504)
     Memory: 28.1M
        CPU: 337ms
     CGroup: /system.slice/elmond.service
             └─273635 /usr/bin/python3 /usr/lib/perfsonar/elmond/app.py

Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]:   File "/usr/lib/python3.9/site-packages/opensearchpy/client/__init__.py", line 1742, in search
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]:     return self.transport.perform_request(
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]:   File "/usr/lib/python3.9/site-packages/opensearchpy/transport.py", line 446, in perform_request
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]:     raise e
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]:   File "/usr/lib/python3.9/site-packages/opensearchpy/transport.py", line 409, in perform_request
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]:     status, headers_response, data = connection.perform_request(
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]:   File "/usr/lib/python3.9/site-packages/opensearchpy/connection/http_urllib3.py", line 276, in perform_request
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]:     raise SSLError("N/A", str(e), e)
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: opensearchpy.exceptions.SSLError: ConnectionError([Errno 13] Permission denied) caused by: SSLError([Errno 13] Pe>
Apr 20 10:58:09 ps-slough-bw.perf.ja.net python3[273635]: 2024-04-20 10:58:09,135 - werkzeug - INFO - 127.0.0.1 - - [20/Apr/2024 10:58:09] "GET /?source=ps-slough-lat.perf>

The error is showing at every minute.

rhclopes commented 2 months ago

I am available for a debugging session on Monday or Tuesday. You choose the time.

I am interesting in installing 5.1 through update because that is the path most institutes in the WLCG will follow.

rhclopes commented 2 months ago

The two fixes made yesterday definitely corrected those two errors: elmond is clean, and psceduler-agent is not being downed at every hour.

Yet, opensearch doesn't seem to have recent data.

When I run

curl -k -u pscheduler_reader:xxxx https://localhost:9200/pscheduler*/_search |jq|tail

          "schedule": {
            "start": "2024-03-11T19:38:21Z",
            "until": "2024-03-12T19:38:21Z"
          },
          "host": "127.0.0.1"
        }
      }
    ]
  }
}

12 March is the data when 5.1 nightly was installed.

the logs for dnf show

2024-03-12T12:53:35+0000 DEBUG ---> Package libperfsonar-toolkit-perl.noarch 5.0.8-1.el9 will be upgraded
2024-03-12T12:53:35+0000 DEBUG ---> Package libperfsonar-toolkit-perl.noarch 5.1.0-0.20240304215449.el9 will be an upgrade

From that moment, the local opensearch doesn't seem to have and the central archive also seems to stop have data for tests starting in this host.

p-slough in monit-grafana

Am I reading in the wrong place?

rhclopes commented 2 months ago

And psconfig shows

[root@ps-slough-bw log]# psconfig remote list
=== pScheduler Agent ===
[
   {
      "url": "https://psconfig.opensciencegrid.org/pub/auto/ps-slough-bw.perf.ja.net",
      "configure-archives": true
   },
   {
      "url": "https://psconfig.opensciencegrid.org/pub/auto/ps-slough-lat.perf.ja.net",
      "configure-archives": true
   }
]

=== Grafana Agent ===
[
   {
      "url": "https://psconfig.opensciencegrid.org/pub/auto/ps-slough-bw.perf.ja.net",
      "configure-archives": true
   },
   {
      "url": "https://psconfig.opensciencegrid.org/pub/auto/ps-slough-lat.perf.ja.net",
      "configure-archives": true
   }
]
rhclopes commented 2 months ago

After update to 5.1b, all entriews in opensearch.log are like

[2024-04-25T18:15:06,839][WARN ][o.o.p.c.u.JsonConverter  ] [ps-slough-bw.perf.ja.net] Json Mapping Error: Cannot invoke "java.lang.Long.longValue()" because "this.cacheMaxSize" is null (through reference chain: org.opensearch.performanceanalyzer.collectors.CacheConfigMetricsCollector$CacheMaxSizeStatus["Cache_MaxSize"])

That error doesn't exist in 5.08.

rhclopes commented 2 months ago

When I run

wget https://psconfig.opensciencegrid.org/pub/auto/ps-slough-lat.perf.ja.net

The file shows

    "host-archive48": {
      "archiver": "esmond",
      "data": {
        "url": "https://ps-slough-lat.perf.ja.net/esmond/perfsonar/archive",
        "measurement-agent": "{% scheduled_by_address %}",
        "schema": 1
      }
    },

Is that right? the host ps-slough is on 5.1b. It doesn't have Esmond, right?