progress / iceberg

A collection of code, utilities, and guides from real-world customer engagements.
Other
12 stars 7 forks source link

Error Parsing HealthCheck Dataset when monitoring multiple PASOE instances? #25

Closed c3rberus closed 1 year ago

c3rberus commented 1 year ago

Looking at deploying the "monitoring" side of iceberg in production. The collector and monitoring application is deployed on the same machine (OE 12.2.12, AIX).

I can successfully deploy the Monitoring/Collector PASOE instance/database, and the Monitoring/Application code without issues, enabling metrics (pulse). Using default ports, I can get to the web UI and see monitoring data.

The moment I go to deploy Monitoring/Application to a 2nd PASOE instance (server has multiple PASOE instances), I get errors and there are no metrics being reported in the UI for the 2nd instance.

[2023-08-19T10:39:27.235-07:00] B.Intake | INFO - Health data received from production @ http://127.0.0.1:15605 [2023-08-19T10:39:27.236-07:00] B.Intake | ERROR - Error Parsing HealthCheck Dataset for production @ http://127.0.0.1:15605: Error parsing JSON: expected brace, but found bracket. (15358) [2023-08-19T10:39:27.238-07:00] B.Intake | ERROR - Error Processing Health Data: Unable to read HealthCheck JSON into dataset for further processing.

The monitored PASOE instances are configured with identical settings other than the name/ports.

If I wipe everything and start over, this time deploying Monitoring/Application to the instance that was reporting errors processing health data FIRST - it works without any issues. Adding the 2nd instance (that previously worked when deployed to first), starts to experience the same errors.

Seems like the moment we go to add a 2nd instance to be monitored, the 2nd instance reports errors and no metrics are generated.

When enabling TRACE logging, and reviewing the monitor/temp folder, there is a bunch of .json files that get generated that have references to the FIRST instance that monitoring is deployed on, but no reference to the SECOND instance that has errors.

Any idea what we could do to debug this?

I suppose one could create multiple collector/monitor instances to potentially work around this, but it seems like the solution was built to support pulling in metrics from multiple PASOE instances.

DustinGrau-PSC commented 1 year ago

The server (monitor) should be able to accept data from multiple instances, even from the same server. The data should be sent from 2 distinct ports since that is how PAS instances are distinguished internally.

To debug this, we're going to need a COMPLETE rundown of every command you issued to set up the instances--this includes deploying the application code which supports the advanced monitoring, enabling the health metrics, starting the instances, and most importantly starting the monitoring on each instance. Create 2 documents, one for each of your PAS instances and collect the command run and any output--we need to see if there's any obvious key differences between the enablement processes for these by diff'ing the 2 files.

If the enablement processes are good, then we can move on to the collection side. In all cases there are logging.config options which can be altered--changing the logging-level attributed to the monitoring (collection or parsing) code to TRACE will generate much more detail including dumping of the JSON files being transferred. That's a last-resort as it's obviously going to be quite noisy unless we know exactly what to look for.

c3rberus commented 1 year ago

I was able to re-create the issue, below is output of all the commands that ran.

  1. Created monitor PASOE instance and database with defaults.
  2. Created 2 PAS instances called "pasoe1" and "pasoe2", both with unique ports.
  3. Deployed monitoring application and started the PASOE instances.
  4. Enabled metric collection.
  5. Generated some APSV calls on PASOE instances.

There was no error in intake.log, however the issue remains that in the monitor UI for the 2nd instance (pasoe2) is not reporting any agent activity (screenshots included at the end).

` #proenv

DLC: /usr1/dlc WRKDIR: /usr1/wrk OEM: /usr1/oemgmt OEMWRKDIR: /usr1/wrk_oemgmt

Inserting /usr1/dlc/bin to beginning of path and setting the current directory to /usr1/wrk.

OpenEdge Release 12.2.12 as of Thu Apr 20 15:00:30 EDT 2023

**#cd /usr/local/iceberg/PAS/Monitoring/Collector

proant create -Dwrk=/usr1/wrk**

Buildfile: /usr/local/iceberg/PAS/Monitoring/Collector/build.xml

create: [echo] DLC Home: /usr1/dlc [echo] OpenEdge Version: 12.2.12 [echo] [echo] Creating PAS instance 'monitor' at /usr1/wrk/monitor with ports 8850/8851/8852/8853... [echo] See file create_monitor.txt for details of PAS server creation. [echo] Deploying UI files to ROOT WebApp... [unzip] Expanding: /usr/local/iceberg/PAS/Monitoring/Collector/ui.zip into /usr1/wrk/monitor/webapps/ROOT [echo] Deploying ABL code to monitor ABLApp... [copy] Copying 7 files to /usr1/wrk/monitor/ablapps/monitor/openedge [echo] [echo] Creating DB instance at /usr1/wrk/monitor/db... [mkdir] Created dir: /usr1/wrk/monitor/db [PCTCreateBase] Generating pasmon structure [PCTCreateBase] Copying DB /usr1/dlc/empty8 to pasmon [PCTCreateBase] Loading /usr/local/iceberg/PAS/Monitoring/Collector/schema/pasmon.df in database [echo] Truncating BI: -C truncate bi -G 1 -bi 262128 -biblocksize 16 -cpinternal UTF-8 -cpcoll Basic [echo] BI Grow: -C bigrow 4 -r -cpinternal UTF-8 -cpcoll Basic [copy] Copying 1 file to /usr1/wrk/monitor/openedge [copy] Copying /usr/local/iceberg/PAS/Monitoring/Collector/openedge/logging.config to /usr1/wrk/monitor/openedge/logging.config [copy] Copying 4 files to /usr1/wrk/monitor/bin [copy] Copying 1 file to /usr1/wrk/monitor/openedge [echo] Adjusting logging-tomcat.properties to add %D token (elapsed time in ms)... [copy] Copying 1 file to /usr1/wrk/monitor/conf [copy] Copying /usr1/wrk/monitor/conf/logging-tomcat.properties to /usr1/wrk/monitor/conf/logging-tomcat.properties.bak [copy] Copying 1 file to /usr/local/iceberg/PAS/Monitoring/Collector/openedge [echo] Merging initial properties from 'merge_monitor.openedge.properties'.

BUILD SUCCESSFUL

# proant startup -Dwrk=/usr1/wrk

Buildfile: /usr/local/iceberg/PAS/Monitoring/Collector/build.xml

startup: [echo] DLC Home: /usr1/dlc [echo] OpenEdge Version: 12.2.12 [exec] Starting stopped PASOE instance monitor [exec] info: Archiving previous log files [exec] info: Waiting for PASOE instance startup to complete... [exec] .............. [exec] [exec] info: Getting PASOE instance process ids for 19989002 ... [exec] info: Scanning for startup errors... [exec] info: Scanning ABL applications for errors: monitor [exec] info: No errors found in the PASOE instance /logs folder [exec] info: PASOE startup in 16 seconds [exec] { [exec] "start-action": "start", [exec] "initial-state": "stopped", [exec] "initial-processes": [ 0 ], [exec] "exit-state": "started", [exec] "exit-description": "Starting stopped PASOE instance monitor", [exec] "exit-processes": [ 19989002, 24576564 ], [exec] "exit-status": "0", [exec] "exit-errors": [ [exec] ] [exec] }

BUILD SUCCESSFUL Total time: 17 seconds

#pasman create -U root -G usr -m admin:password -Z prod -N pasoe1 /usr1/wrk/pasoe1

Regenerating ABLDomainRegistry.keystore for AIX OEDomainRegistryUtil v1.5.3 (01/30/2019) Sucessfully generated AIX version of /usr1/wrk/pasoe1/conf/ABLDomainRegistry.keystore server instance pasoe1 created at /usr1/wrk/pasoe1

**#cd /usr1/wrk/pasoe1/bin/

oeprop.sh pasoe1.root.APSV.adapterEnabled=1

oeprop.sh pasoe1.root.APSV.statusEnabled=1

oeprop.sh pasoe1.root.APSV.allowRuntimeUpdates=1

oeprop.sh AppServer.SessMgr.defrdLogNumLines=10240

oeprop.sh AppServer.SessMgr.defrdLoggingLevel=4

oeprop.sh AppServer.SessMgr.defrdLogEntryTypes=4GLTrace

tcman.sh config psc.as.http.port=8860

tcman.sh config psc.as.https.port=8861

tcman.sh config psc.as.shut.port=8862

tcman.sh config psc.as.healthcheck.port=8863

tcman.sh config psc.as.ajp13.port=8864

tcman.sh feature SingleSignOn=off

tcman.sh feature HTTPS=off

tcman.sh feature HealthCheck=on

tcman.sh config psc.as.health.enabled=true**

**#cd /usr/local/iceberg/PAS/Monitoring/Application

proant deploy_metrics -Dwrk=/usr1/wrk -Dinstance=pasoe1**

Buildfile: /usr/local/iceberg/PAS/Monitoring/Application/build.xml

deploy_metrics: [echo] DLC Home: /usr1/dlc [echo] OpenEdge Version: 12.2.12 [exec] Result: 1 [echo] Creating backup of original openedge.properties file... [copy] Copying 1 file to /usr1/wrk/pasoe1/conf [copy] Copying /usr1/wrk/pasoe1/conf/openedge.properties to /usr1/wrk/pasoe1/conf/openedge.properties.bak [echo] Installing application metrics logic into /usr1/wrk/pasoe1... [copy] Copying 10 files to /usr1/wrk/pasoe1/bin [copy] Copying 1 file to /usr1/wrk/pasoe1/openedge [copy] Copying /usr/local/iceberg/PAS/Monitoring/Application/deploy/oe122/logging.config to /usr1/wrk/pasoe1/openedge/logging.config [copy] Copying 9 files to /usr1/wrk/pasoe1/openedge [echo] Adjusting logging-tomcat.properties to add %D token (elapsed time in ms)... [copy] Copying 1 file to /usr1/wrk/pasoe1/conf [copy] Copying /usr1/wrk/pasoe1/conf/logging-tomcat.properties to /usr1/wrk/pasoe1/conf/logging-tomcat.properties.bak

BUILD SUCCESSFUL Total time: 0 seconds

#pasman pasoestart -restart -I pasoe1

Starting stopped PASOE instance pasoe1 ..........

Start action: start Initial state: stopped Initial processes: 0 Exit state: started Exit description: Starting stopped PASOE instance pasoe1 Exit processes: 23986682 15860182 Exit status: 0 Exit errors:

#proant metrics -Dwrk=/usr1/wrk -Dinstance=pasoe1 -DlocalIP=127.0.0.1 -Dtype=pulse -Dmonitor=127.0.0.1 -Dport=8850

Buildfile: /usr/local/iceberg/PAS/Monitoring/Application/build.xml

metrics: [echo] DLC Home: /usr1/dlc [echo] OpenEdge Version: 12.2.12 [exec] Result: 1 [PCTRun] PASOE Instance: /usr1/wrk/pasoe1 [PCTRun] Metrics Type: pulse (on) [PCTRun] [PCTRun] Agent PID 15860182: AVAILABLE [PCTRun] Query: {"O":"PASOE:type=OEManager,name=AgentManager", "M":["debugTest", "15860182", "LiveDiag", "http://127.0.0.1:8850/web/pdo/monitor/intake/liveMetrics", 20, "sessions,requests,calltrees,ablobjs|app=pasoe1|host=127.0.0.1|name=Metrics_2023-08-21T20:34:13.970-07:00|health=http://127.0.0.1:8850/web/pdo/monitor/intake/liveHealth"]} [PCTRun] Result: {"debugTest":{"ABLOutput":{"description":"sessions,requests,calltrees,ablobjs|app=pasoe1|host=127.0.0.1|name=Metrics_2023-08-21T20:34:13.970-07:00|health=http:\/\/127.0.0.1:8850\/web\/pdo\/monitor\/intake\/liveHealth","interval":20,"operation":"LiveDiag","target":"http:\/\/127.0.0.1:8850\/web\/pdo\/monitor\/intake\/liveMetrics"},"ABLReturnVal":true,"agentId":"LUbccencRYqCfxaGB0MxQw","pid":"15860182"}}

BUILD SUCCESSFUL Total time: 6 seconds

**#cd /usr1/wrk/monitor/temp/intake

cat intake.log**

[2023-08-21T20:34:20.570-07:00] B.Intake | INFO - Metrics Received from 'pasoe1' @ 'http://127.0.0.1:8860' [2023-08-21T20:34:21.147-07:00] B.Intake | INFO - Health data received from pasoe1 @ http://127.0.0.1:8860 [2023-08-21T20:34:40.063-07:00] B.Intake | INFO - Metrics Received from 'pasoe1' @ 'http://127.0.0.1:8860' [2023-08-21T20:34:40.416-07:00] B.Intake | INFO - Health data received from pasoe1 @ http://127.0.0.1:8860

#pasman create -U root -G usr -m admin:password -Z prod -N pasoe2 /usr1/wrk/pasoe2

Regenerating ABLDomainRegistry.keystore for AIX OEDomainRegistryUtil v1.5.3 (01/30/2019) Sucessfully generated AIX version of /usr1/wrk/pasoe2/conf/ABLDomainRegistry.keystore server instance pasoe2 created at /usr1/wrk/pasoe2

**#cd /usr1/wrk/pasoe2/bin/

oeprop.sh pasoe2.root.APSV.adapterEnabled=1

oeprop.sh pasoe2.root.APSV.statusEnabled=1

oeprop.sh pasoe2.root.APSV.allowRuntimeUpdates=1

oeprop.sh AppServer.SessMgr.defrdLogNumLines=10240

oeprop.sh AppServer.SessMgr.defrdLoggingLevel=4

oeprop.sh AppServer.SessMgr.defrdLogEntryTypes=4GLTrace

tcman.sh config psc.as.http.port=8870

tcman.sh config psc.as.https.port=8871

tcman.sh config psc.as.shut.port=8872

tcman.sh config psc.as.healthcheck.port=8873

tcman.sh config psc.as.ajp13.port=8874

tcman.sh feature SingleSignOn=off

tcman.sh feature HTTPS=off

tcman.sh feature HealthCheck=on

tcman.sh config psc.as.health.enabled=true**

**#cd /usr/local/iceberg/PAS/Monitoring/Application

proant deploy_metrics -Dwrk=/usr1/wrk -Dinstance=pasoe2**

Buildfile: /usr/local/iceberg/PAS/Monitoring/Application/build.xml

deploy_metrics: [echo] DLC Home: /usr1/dlc [echo] OpenEdge Version: 12.2.12 [exec] Result: 1 [echo] Creating backup of original openedge.properties file... [copy] Copying 1 file to /usr1/wrk/pasoe2/conf [copy] Copying /usr1/wrk/pasoe2/conf/openedge.properties to /usr1/wrk/pasoe2/conf/openedge.properties.bak [echo] Installing application metrics logic into /usr1/wrk/pasoe2... [copy] Copying 10 files to /usr1/wrk/pasoe2/bin [copy] Copying 1 file to /usr1/wrk/pasoe2/openedge [copy] Copying /usr/local/iceberg/PAS/Monitoring/Application/deploy/oe122/logging.config to /usr1/wrk/pasoe2/openedge/logging.config [copy] Copying 9 files to /usr1/wrk/pasoe2/openedge [echo] Adjusting logging-tomcat.properties to add %D token (elapsed time in ms)... [copy] Copying 1 file to /usr1/wrk/pasoe2/conf [copy] Copying /usr1/wrk/pasoe2/conf/logging-tomcat.properties to /usr1/wrk/pasoe2/conf/logging-tomcat.properties.bak

BUILD SUCCESSFUL Total time: 0 seconds

#pasman pasoestart -restart -I pasoe2

Starting stopped PASOE instance pasoe2 ..........

Start action: start Initial state: stopped Initial processes: 0 Exit state: started Exit description: Starting stopped PASOE instance pasoe2 Exit processes: 26280388 23200492 Exit status: 0 Exit errors:

#proant metrics -Dwrk=/usr1/wrk -Dinstance=pasoe2 -DlocalIP=127.0.0.1 -Dtype=pulse -Dmonitor=127.0.0.1 -Dport=8850

Buildfile: /usr/local/iceberg/PAS/Monitoring/Application/build.xml

metrics: [echo] DLC Home: /usr1/dlc [echo] OpenEdge Version: 12.2.12 [exec] Result: 1 [PCTRun] PASOE Instance: /usr1/wrk/pasoe2 [PCTRun] Metrics Type: pulse (on) [PCTRun] [PCTRun] Agent PID 23200492: AVAILABLE [PCTRun] Query: {"O":"PASOE:type=OEManager,name=AgentManager", "M":["debugTest", "23200492", "LiveDiag", "http://127.0.0.1:8850/web/pdo/monitor/intake/liveMetrics", 20, "sessions,requests,calltrees,ablobjs|app=pasoe2|host=127.0.0.1|name=Metrics_2023-08-21T20:40:25.998-07:00|health=http://127.0.0.1:8850/web/pdo/monitor/intake/liveHealth"]} [PCTRun] Result: {"debugTest":{"ABLOutput":{"description":"sessions,requests,calltrees,ablobjs|app=pasoe2|host=127.0.0.1|name=Metrics_2023-08-21T20:40:25.998-07:00|health=http:\/\/127.0.0.1:8850\/web\/pdo\/monitor\/intake\/liveHealth","interval":20,"operation":"LiveDiag","target":"http:\/\/127.0.0.1:8850\/web\/pdo\/monitor\/intake\/liveMetrics"},"ABLReturnVal":true,"agentId":"MGEMpWT3Skm47EuO8FB6hQ","pid":"23200492"}}

BUILD SUCCESSFUL Total time: 5 seconds

**#cd /usr1/wrk/monitor/temp/intake

cat intake.log**

[2023-08-21T20:41:21.786-07:00] B.Intake | INFO - Metrics Received from 'pasoe1' @ 'http://127.0.0.1:8860' [2023-08-21T20:41:22.121-07:00] B.Intake | INFO - Health data received from pasoe1 @ http://127.0.0.1:8860 [2023-08-21T20:41:31.809-07:00] B.Intake | INFO - Metrics Received from 'pasoe2' @ 'http://127.0.0.1:8870' [2023-08-21T20:41:32.150-07:00] B.Intake | INFO - Health data received from pasoe2 @ http://127.0.0.1:8870 [2023-08-21T20:41:41.965-07:00] B.Intake | INFO - Metrics Received from 'pasoe1' @ 'http://127.0.0.1:8860' [2023-08-21T20:41:42.299-07:00] B.Intake | INFO - Health data received from pasoe1 @ http://127.0.0.1:8860 [2023-08-21T20:41:51.895-07:00] B.Intake | INFO - Metrics Received from 'pasoe2' @ 'http://127.0.0.1:8870' [2023-08-21T20:41:52.235-07:00] B.Intake | INFO - Health data received from pasoe2 @ http://127.0.0.1:8870 `

Links to UI screenshot for: pasoe1 instance pasoe2 instance.

Just to rule out browser issue, I tried Chrome/Edge (with and without Incognito), clearing all cache, etc.

c3rberus commented 1 year ago

I also tried restarting the two PASOE instances and re-enabling metrics, same issue.

` #pasman pasoestart -restart -I pasoe1

Stop running PASOE instance pasoe1 NOTE: Picked up JDK_JAVA_OPTIONS: --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED NOTE: Picked up JDK_JAVA_OPTIONS: --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED . Restarting the running PASOE instance pasoe1 ..........

Start action: restart Initial state: started Initial processes: 23986682 15860182 Exit state: started Exit description: Restarting the running PASOE instance pasoe1 Exit processes: 25362768 23920996 Exit status: 0 Exit errors:

#pasman pasoestart -restart -I pasoe2

Stop running PASOE instance pasoe2 NOTE: Picked up JDK_JAVA_OPTIONS: --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED NOTE: Picked up JDK_JAVA_OPTIONS: --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.rmi/sun.rmi.transport=ALL-UNNAMED . Restarting the running PASOE instance pasoe2 ..........

Start action: restart Initial state: started Initial processes: 26280388 23200492 Exit state: started Exit description: Restarting the running PASOE instance pasoe2 Exit processes: 17826430 26280392 Exit status: 0 Exit errors:

**#cd /usr/local/iceberg/PAS/Monitoring/Application

proant metrics -Dwrk=/usr1/wrk -Dinstance=pasoe1 -DlocalIP=127.0.0.1 -Dtype=pulse -Dmonitor=127.0.0.1 -Dport=8850**

Buildfile: /usr/local/iceberg/PAS/Monitoring/Application/build.xml

metrics: [echo] DLC Home: /usr1/dlc [echo] OpenEdge Version: 12.2.12 [exec] Result: 1 [PCTRun] PASOE Instance: /usr1/wrk/pasoe1 [PCTRun] Metrics Type: pulse (on) [PCTRun] [PCTRun] Agent PID 23920996: AVAILABLE [PCTRun] Query: {"O":"PASOE:type=OEManager,name=AgentManager", "M":["debugTest", "23920996", "LiveDiag", "http://127.0.0.1:8850/web/pdo/monitor/intake/liveMetrics", 20, "sessions,requests,calltrees,ablobjs|app=pasoe1|host=127.0.0.1|name=Metrics_2023-08-21T21:22:10.151-07:00|health=http://127.0.0.1:8850/web/pdo/monitor/intake/liveHealth"]} [PCTRun] Result: {"debugTest":{"ABLOutput":{"description":"sessions,requests,calltrees,ablobjs|app=pasoe1|host=127.0.0.1|name=Metrics_2023-08-21T21:22:10.151-07:00|health=http:\/\/127.0.0.1:8850\/web\/pdo\/monitor\/intake\/liveHealth","interval":20,"operation":"LiveDiag","target":"http:\/\/127.0.0.1:8850\/web\/pdo\/monitor\/intake\/liveMetrics"},"ABLReturnVal":true,"agentId":"nuJ4mP_wSbmcRgSSXfGiqQ","pid":"23920996"}}

BUILD SUCCESSFUL Total time: 6 seconds

#proant metrics -Dwrk=/usr1/wrk -Dinstance=pasoe2 -DlocalIP=127.0.0.1 -Dtype=pulse -Dmonitor=127.0.0.1 -Dport=8850

Buildfile: /usr/local/iceberg/PAS/Monitoring/Application/build.xml

metrics: [echo] DLC Home: /usr1/dlc [echo] OpenEdge Version: 12.2.12 [exec] Result: 1 [PCTRun] PASOE Instance: /usr1/wrk/pasoe2 [PCTRun] Metrics Type: pulse (on) [PCTRun] [PCTRun] Agent PID 26280392: AVAILABLE [PCTRun] Query: {"O":"PASOE:type=OEManager,name=AgentManager", "M":["debugTest", "26280392", "LiveDiag", "http://127.0.0.1:8850/web/pdo/monitor/intake/liveMetrics", 20, "sessions,requests,calltrees,ablobjs|app=pasoe2|host=127.0.0.1|name=Metrics_2023-08-21T21:22:18.325-07:00|health=http://127.0.0.1:8850/web/pdo/monitor/intake/liveHealth"]} [PCTRun] Result: {"debugTest":{"ABLOutput":{"description":"sessions,requests,calltrees,ablobjs|app=pasoe2|host=127.0.0.1|name=Metrics_2023-08-21T21:22:18.325-07:00|health=http:\/\/127.0.0.1:8850\/web\/pdo\/monitor\/intake\/liveHealth","interval":20,"operation":"LiveDiag","target":"http:\/\/127.0.0.1:8850\/web\/pdo\/monitor\/intake\/liveMetrics"},"ABLReturnVal":true,"agentId":"nqZqv_qXQUq2YQ2wKuluLA","pid":"26280392"}}

BUILD SUCCESSFUL Total time: 6 seconds `

DustinGrau-PSC commented 1 year ago

Thank you so much for all the legwork on this, I agree that all looks expected. And the screenshots may have tipped me off to the problem. When you switch between PAS instances which have been collected (as confirmed by the intake.log) you have different Agent PID's and sessions, and sometimes the UI can get confused as to which was the last one it viewed. Here's what I think happened: when you switched between the PAS instances, the UI wasn't able to default to any specific PID and so it can't retrieve the necessary data. Go back to the Overview tab and select one of the agents (or better, a specific agent-session) and see if the context loads correctly for the remaining tabs for that PAS instance.

c3rberus commented 1 year ago

Hoping we can use the monitoring UI as it would fill a very critical gap in being able to have trended metrics of the PASOE stack.

I can confirm that the Overview tab of both pasoe1 and pasoe2 shows data (agents/sessions). It does seems to be a bug with switching between the pasoe instances. I was able to reproduce the blank agent activity on both pasoe1/pasoe2.

Here is a screen recording link.

DustinGrau-PSC commented 1 year ago

Gotcha, and that's very helpful. That confirms that the UI gets confused when switching between instances, as it's either not able to recall the last PID that was selected or it has wiped out that info by switching the server instance. I've run into this before but never so clearly reproduced the conditions. The workaround for the moment would be to select a PID when you first change servers, and the tell-tale sign of that is seeing "Agent -" before any of the graph titles--there should always be an "Agent ##### -" when a PID is in context.

c3rberus commented 1 year ago

I see what you mean with the Agent ####, and by selecting the session from the Overview page, it still shows the birds eye view agent activity which is great. I will use the workaround and keep an eye out on future commits for a fix :)

Are there any concerns (stability/performance) running with LiveDiag debug feature in production? I see it is being referred to as an experimental feature, and that the deploy_metrics deploys a set of R-code files that override some of the product-supplied ABL code; while it also states that the code may become part of the standard product at some point.

My goal is to use this in production when we switch from legacy to PASOE to have visibility into the PASOE stack, it is way better than fiddling with JConsole.

DustinGrau-PSC commented 1 year ago

Overall this solution is much more performant (and safer) for production use than anything implemented prior. The collection of data should not impede normal operation of the agent, though we have not tested this at scale with super-high-transaction environments. Yes, we've pushed server instances internally and we have customers using this in the field, but we don't actively monitor how they use it or what kind of experiences they have. Usually if there's a problem then it would come back to me here in this repo, though most issues reported have been minor.

Most of the internal data gets purged between pulses, so there's going to be a tradeoff between pushing out the data (to clear the constructs and free memory internally) and how frequently you push that data out (as pushing every 1-2 seconds can have a bad effect). The default time used for the pulse interval at the moment is a good compromise on those points. The other factor is what you collect--the calltrees will enable a portion of the profiler code and so there could be a timing impact from that as well as retaining a lot of data that may never be used. At the minimum, reporting on the sessions and requests will give good insight but not overwhelm the collection system.

c3rberus commented 1 year ago

Interesting insight, thanks for that.

I am primarily after agent/session requests and memory metrics, not so much the calltree for now. Is there a way to exclude calltree and will the UI work with that data not being reported?

DustinGrau-PSC commented 1 year ago

If you look at the "pulse_on" scripts, the default settings are "OPTIONS=sessions,requests,calltrees,ablobjs" which sends quite a bit of information as this is assumed to be used in development primarily. For production use the recommended settings are "OPTIONS=sessions,requests" which will provide comprehensive session data, including tracking of requests which helps determine any spikes in response times. If necessary you can scale back to just "sessions" which is very much bare-bones and the absolute minimum.

https://github.com/progress/iceberg/blob/main/PAS/Monitoring/Application/deploy/oe122/bin/pulse_on.sh

c3rberus commented 1 year ago

Perfect, thanks Dustin for all your help and an invaluable tool.

DustinGrau-PSC commented 1 year ago

If you care to try out some development code I've made some modifications to a new branch. The main change was a ui.zip file in the Monitor/Collector folder. You can either unzip that into your ROOT/static folder or just re-deploy the Monitor instance.

https://github.com/progress/iceberg/tree/Sept23

c3rberus commented 1 year ago

Dustin,

I have some good news :) I pulled the development code and re-created my test case, it works now. I have not been able to reproduce the original issue where it was not refreshing Agent Activity when switching between PASOE instances.

DustinGrau-PSC commented 1 year ago

Great! Then I've submitted the code for review and publication to the main branch, and will consider this issue closed.