progress / iceberg

A collection of code, utilities, and guides from real-world customer engagements.
Other
12 stars 7 forks source link

Agent memory metrics reported by proant status do not match what the OS is reporting for the _mproaps process? #11

Closed c3rberus closed 1 year ago

c3rberus commented 1 year ago

Hello,

I am using iceberg to manage our PAS instance, when using the proant status command, the various memory metrics reported do not match what the OS level process is consuming.

In the example below, I have an OEPAS APSV ABL session running that is consuming large amounts of memory, but none of the metrics inside the tool add up to the total memory being consumed by the process.

Buildfile: /usr/local/iceberg/build.xml

status:
     [echo] Code Release: 2022-07-07T11:56:15.191
     [echo] DLC Home: /usr3/dlc
     [echo] OpenEdge Version: 12.2.10
   [PCTRun] Starting output to file: status_training_2022-10-27T09_51_03.891-07_00.txt ...
   [PCTRun]
   [PCTRun] Utility Runtime: 12.2.10.0.1386
   [PCTRun] Report Executed: 2022-10-27T09:51:03.894-07:00
   [PCTRun]  PASOE Instance: http://localhost:17128
   [PCTRun]
   [PCTRun] ABL Application Information [training - v12.2.10 ( 2022-06-10 )]
   [PCTRun]     WebApp: ROOT
   [PCTRun]       APSV: ENABLED
   [PCTRun]       SOAP: DISABLED
   [PCTRun]       REST: DISABLED
   [PCTRun]        WEB: DISABLED
   [PCTRun]
   [PCTRun] Manager Properties
   [PCTRun]             Maximum Agents:           1
   [PCTRun]             Minimum Agents:           1
   [PCTRun]             Initial Agents:           1
   [PCTRun]     Max. Connections/Agent:         300
   [PCTRun]     Max. ABLSessions/Agent:         300
   [PCTRun]         Idle Conn. Timeout:     300,000 ms (00:05:00.000)
   [PCTRun]       Idle Session Timeout:   1,800,000 ms (00:30:00.000)
   [PCTRun]         Idle Agent Timeout:   1,800,000 ms (00:30:00.000)
   [PCTRun]      Idle Resource Timeout:           0 ms (00:00:00.000)
   [PCTRun]         Conn. Wait Timeout:       3,000 ms (00:00:03.000)
   [PCTRun]       Request Wait Timeout:      15,000 ms (00:00:15.000)
   [PCTRun]     Initial Sessions/Agent:          60
   [PCTRun]     Min. Avail. Sess/Agent:           1
   [PCTRun]
   [PCTRun] > Agent PID 14484034: AVAILABLE
   [PCTRun]     Est. Agent Lifetime: 01:09:08.549
   [PCTRun]     DynMax ABL Sessions:          300
   [PCTRun]      Total ABL Sessions:           33
   [PCTRun]      Avail ABL Sessions:           32
   [PCTRun]        Open Connections:            2
   [PCTRun]         Overhead Memory: 866,162 KB
   [PCTRun]
   [PCTRun]     SESSION ID      STATE           STARTED                         LIFETIME        SESS. MEMORY    BOUND/ACTIVE CLIENT SESSION
   [PCTRun]                4    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       7,449 KB      -
   [PCTRun]                7    ACTIVE          10/27/2022 08:41:55.463+07:00   01:09:08.549       5,434 KB      -
   [PCTRun]                8    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]                9    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               10    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               11    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               12    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               13    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               14    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               15    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               16    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               17    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               18    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               19    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               20    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               22    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               23    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               26    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               29    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               31    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               32    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               34    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               35    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               38    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               42    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               45    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               49    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               50    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               53    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               56    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               62    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               63    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]               65    IDLE            10/27/2022 08:41:55.463+07:00   01:09:08.549       3,521 KB      -
   [PCTRun]       Active Agent-Sessions: 1 of 33 (3% Busy)
   [PCTRun]     Utilized Agent-Sessions: 2 of 33 (>3,522 KB)
   [PCTRun]        Approx. Agent Memory: 988,192 KB
   [PCTRun]
   [PCTRun] Session Manager Metrics (Count-Based)
   [PCTRun]            # Requests to Session:          201
   [PCTRun]           # Agent Responses Read:          473 (0 Errors)
   [PCTRun]         # Agent Requests Written:          404 (0 Errors)
   [PCTRun]     Concurrent Connected Clients:            1 (Max: 6)
   [PCTRun]       # Reserve ABLSession Waits:            0
   [PCTRun]     # Reserve ABLSession Timeout:            0
   [PCTRun]
   [PCTRun] Client HTTP Sessions: 1
   [PCTRun]     STATE     SESS STATE  BOUND     LAST ACCESS / STARTED           ELAPSED TIME  SESSION MODEL    ADAPTER   SESSION ID                                                                                                                                                                                              REQUEST ID
   [PCTRun]
   [PCTRun]     RUNNING   RESERVED    NO        2022-10-27T09:50:59.569-0700    00:00:04.498  SESSION_FREE     APSV      7F9C029DBBBA14BFD1AC1E82A2B0499FD2753E3EB092.training                                                                                                                                                 ROOT:a:0000020e
   [PCTRun]     |- ClientConn: 172.16.104.254   2022-10-27T09:50:59.569-0700    00:00:04.498  Proc: mytest123.p
   [PCTRun]     |-- AgentConn: PID 14484034  RESERVED  Agent: xxx-xxxxx/172.16.4.16:17129  Local: /172.16.4.16:37852
   [PCTRun]

BUILD SUCCESSFUL
Total time: 0 seconds

Looking at the OS level topas utility, it reports 26GB for the _mproapsv process.

Name           PID  CPU%  PgSp Owner   
_mproaps   14484034  2.0 26.3G prosvr  
java       16843502  1.8 39.9M root    
java        5505790  0.1  130M prosvr  
perl5.28   22086076  0.1 5.43M lpar2rrd
reaffin      459030  0.0  640K root    
java       21234068  0.0  124M prosvra 
bash       17105498  0.0 1.05M root    
topas      15270370  0.0 9.75M mirde   
ipsec_lo   22020564  0.0 1020K root    
_mproaps   22348164  0.0 23.2M prosvr  
_mprosrv   23527926  0.0 2.41M prodb
java       19268342  0.0  148M prosvr
_mprosrv   23462254  0.0 2.39M prodb
vtiol        655644  0.0 6.31M root
java       20251242  0.0  148M prosvra
nvme_kpr    5570882  0.0 1.25M root
swapper         520  0.0  448K root
_mprshut   20840770  0.0 2.07M prodb
Exiting from topas1  0.0  872K root

Is there a way to report on what ABL agent session is using up the memory? In this case, my test case program (mytest123.p) is written to allocate a bunch of memory on purpose for testing, but that is not being reported inside the tool.

Is there a way for the tool to report on what agent session is holding onto this memory? In this case, it should be session id 7.

DustinGrau-PSC commented 1 year ago

Before I attempt an answer, it may be worthwhile to ask if you could provide the mentioned code which is “allocating memory” to understand what you’re attempting to illustrate. Also, an update has just been published for this tooling if you wish to re-run your tests with the latest version.

c3rberus commented 1 year ago

Hi Dustin,

Using a pro session to run client.p, this connects to the PASOE APSV instance and runs PasMem.p procedure.

Code for client.p is below...

def var h as handle no-undo.

/* store APSV connection in an array */
CREATE SERVER h.
h:CONNECT( "-URL http://xxx-xxxxx:12345/apsv -sessionModel Session-Free", "", "~
", "" ).

IF h:CONNECTED() then
do:
/* run something*/
RUN PasMem.p ON SERVER h.

end.

The PasMem.p procedure has:

def var myLongChar0 as longchar.
def var myLongChar1 as longchar.
def var myLongChar2 as longchar.
def var myLongChar3 as longchar.
def var myLongChar4 as longchar.
def var myLongChar5 as longchar.
def var myLongChar6 as longchar.
def var myLongChar7 as longchar.
def var myLongChar8 as longchar.
def var myLongChar9 as longchar.

//do while length(myLongChar) < 67127 * 31991:
myLongChar0 = fill("a",31991).
do while length(myLongChar1) < 62000 * 31991:
  myLongChar1 = myLongChar1 + myLongChar0.
end.

myLongChar0 = fill("b",31991).
do while length(myLongChar2) < 62000 * 31991:
  myLongChar2 = myLongChar2 + myLongChar0.
end.

myLongChar0 = fill("c",31991).
do while length(myLongChar3) < 62000 * 31991:
  myLongChar3 = myLongChar3 + myLongChar0.
end.

myLongChar0 = fill("d",31991).
do while length(myLongChar4) < 62000 * 31991:
  myLongChar4 = myLongChar4 + myLongChar0.
end.

myLongChar0 = fill("e",31991).
do while length(myLongChar5) < 62000 * 31991:
  myLongChar5 = myLongChar5 + myLongChar0.
end.

myLongChar0 = fill("f",31991).
do while length(myLongChar6) < 62000 * 31991:
  myLongChar6 = myLongChar6 + myLongChar0.
end.

myLongChar0 = fill("g",31991).
do while length(myLongChar7) < 62000 * 31991:
  myLongChar7 = myLongChar7 + myLongChar0.
end.

myLongChar0 = fill("h",31991).
do while length(myLongChar8) < 62000 * 31991:
  myLongChar8 = myLongChar8 + myLongChar0.
end.

myLongChar0 = fill("i",31991).
do while length(myLongChar9) < 62000 * 31991:
  myLongChar9 = myLongChar9 + myLongChar0.
end.

//pause 30 no-message.

/* deallocate */
myLongChar0 = ?.
myLongChar1 = ?.
myLongChar2 = ?.
myLongChar3 = ?.
myLongChar4 = ?.
myLongChar5 = ?.
myLongChar6 = ?.
myLongChar7 = ?.
myLongChar8 = ?.
myLongChar9 = ?.

I am basically calling a procedure that allocates a bunch of text into a longchar and as expected the _mproapsv process memory increases significantly. In the system the _mproaps process grows very large (30gb+), when I run proant status the memory metrics reported do not match the multi-session agent process memory.

I was under the impression that this would be reflected somewhere in the output, either by adding up all the ABL session memory, or the overhead memory, or approx. agent memory - but they do not accurately report the same metrics as OS level tools (nmon, topas, ps).

My goal is to be able to see more detailed memory usage inside the multi-session agent, per ABL session; how much memory is each active/idle session using? I thought this was SESS.MEMORY but it does not add up.

DustinGrau-PSC commented 1 year ago

Thank you for the sample code, that was extremely helpful for allowing me to see the memory changes on my local environment for context.

Let me first respond to your primary question "Is there a way for the tool to report on what agent session is holding onto this memory?" with a simple "that's precisely what it's doing, but not in the way you were expecting". What you see reported is the overhead memory of the MSAgent process (_mproapsv) which is anything NOT tied to an ABL Session, while the memory reported by each ABL Session is what is currently allocated/needed by that session. Depending on how well you clean up after your own code (eg. DELETE OBJECT, dereference memptr's, deallocate longchar's, etc.) affects how much memory may be consumed during operation by the end of a request. The reported "Approximate Agent Memory" is exactly that--it's the total of the overhead memory of the agent PLUS each ABL Session's current tracked memory, and was distinctly given the prefix "approximate" because it will never be exact to what the OS reports.

Second, this tooling cannot be changed to provide what you're asking as it is only reporting on what is made available from the PAS instance itself. It utilizes the OEManager endpoint which communicates with the PAS instance to obtain information via MBeans. It's the same background source used by the OEJMX utility, and would be the same reported if you used JConsole or a similar tool to talk directly to the Tomcat instance to obtain this memory information. If there is a change to be made, it's at the product level and your query should be sent through the usual channels to request a product enhancement.

Lastly, what you get from the internal counters is what the server sees of itself and is potentially different from what the OS may see--and that's an important distinction. The MSAgent can allocate and deallocate its allotted memory from the OS to ABL Sessions during operation, but how that parent process (_mproapsv) is treated by the OS is different. This analogy may not be precise, but think of system memory like a balloon--as the agent needs more memory from the OS its balloon inflates. But as each ABL Session performs proper cleanup it may no longer need memory, but the MSAgent can't just give a portion back to the OS, and so the balloon remains inflated. The OS has effectively allocated an amount of space for that balloon regardless of how the MSAgent utilizes the internal space--and so what the MSAgent knows of its own needs is different from what the OS sees, and thus a potential discrepancy may exist.

c3rberus commented 1 year ago

Hi Dustin,

Thanks for the very detailed response, that helps with understanding the tool and _mproapsv memory allocation in general and ballooning.

One question I had for you, you mention "_What you see reported is the overhead memory of the MSAgent process (mproapsv) which is anything NOT tied to an ABL Session" and this now makes sense for what the metrics behind overhead memory are reporting.

You also mention that "memory reported by each ABL Session is what is currently allocated/needed by that session", going back to my code example of longchar, when I run the procedure, the ABL code would allocate gigs of memory and this is confirmed by looking at the _mproapsv process grow using OS utilities, however when I look at this using proant status, the ABL session memory metrics are nowhere near the memory metrics reported by the OS.

The sample code when executed should spike the ABL session memory to about 15gig as confirmed by monitoring the growth of the _mproapsv process, but SESS. MEMORY only reports a few megs at most.

When you ran this in your local environment, on a newly spun up PAS instance with nothing else running, if you run the procedure you should see _mproapsv spike into gigs of memory, and if you run a "proant status" while the procedure is executing, the ABL session memory metrics would be heavily under reported (megs vs. gigs).

While slightly off topic and probably more reserved for Progress TS, to make things worse if I re-run the same piece of code against the same _mproapsv process, it continues to balloon _mproapsv agent process memory even though we explicitly deallocate the longchar variables, seems very contradictory to the analogy that you provided. While the analogy is on-point and how I also understood and confirmed things to be working when running with legacy AppServers broker/agent architecture.

DustinGrau-PSC commented 1 year ago

I am not knowledgeable enough to accurately describe how the ABL Session memory is accounted at runtime, or how it may track that information (whether it is a real-time report, or only produces results at the boundary of a completed request). My best guess is that the timing of when per-session memory is reported may be subject to any explicit destruction of objects, handles, etc. and/or garbage collection, which may negate/reduce the reported total once the code has completed execution.

It sounds as though you may need to direct your questions to the Progress Communities forum or more directly open a request to TS for clarity on your particular example. For the purposes of this ticket as it was opened, regarding this tooling for accessing the server metrics and summing the available data, the feature works as expected given the available parameters.

DustinGrau-PSC commented 1 year ago

"...if you run the procedure you should see _mproapsv spike into gigs of memory, and if you run a "proant status" while the procedure is executing, the ABL session memory metrics would be heavily under reported (megs vs. gigs)."

As I read this again, this point may be the source of the misunderstanding. The internal metrics may not be counting every bit of memory allocated (whether by design or by availability of the necessary data), meaning my statement that the metrics track what the ABL Session needs was incorrect. What the OS sees will always be correct, period, because the OS is in charge of all running processes. There are indeed objects and structures which are tracked within each ABL Session, and we can use this data to identify certain leaks in ABL code, but there may be other factors that are preventing your particular use-case from exhibiting the results you are expecting to see. I maintain that your best option is to present a question to tech support which may result in a KB article that could be of use to you and others.

c3rberus commented 1 year ago

Thanks for the feedback Dustin, appreciated! I will close this issue now and take it up with Progress TS since some of my questions are out of scope of the management/monitoring tool.