Killed without a trace - Githubissues

Naheel-Azawy commented 1 year ago

Checklist

[X] I checked other issues already, but found no answer/solution
[X] I checked the documentation and wiki, but found no answer/solution
[ ] I am running the latest version and the issue still occurs
[ ] I am sure that this issue is about SteVe (and not about the charging station software or something unrelated to SteVe)

Specifications

SteVe Version     : 3.4.9
Operating system  : Arch Linux, kernel 5.19.12-arch1-1
JDK               : OpenJDK 64-Bit Server VM (build 18.0.2+0, mixed mode)
Database          : mariadb  Ver 15.1 Distrib 10.9.3-MariaDB, for Linux (x86_64) using readline 5.1

Expected Behavior

The server not getting killed for no known reason

Actual Behavior

The server gets killed with no clear reason to be found in the logs.

I have been running steve since October last year for some experiments. Most of the experiments were done with limited amount of time and on experimental custom built setups. Around 2 weeks ago, I connected a commercial charger to the server, Kostad CPC50, which is running 24/7. Around a week ago (rough estimate as I don't have any records for that), I noticed that steve is dead. I started it again without overthinking. Today, I noticed the same thing happening again.

Steps to Reproduce the Problem

Run steve for long enough
Connect from a Kostad CPC50. Maybe that's unrelated, not sure
Wait for week(s)
Check if it gets killed

Additional context

I have steve running on tmux, it can be seen that it says "Killed":

$ java -jar target/steve.jar
Log file: /home/me/logs/steve.log
Starting............................... Done!
Hint: You can stop the application by pressing CTRL+C

Access the web interface using
- http://some.ip:8180/steve/manager/home
- http://some.other.ip:8180/steve/manager/home
- https://some.ip:8443/steve/manager/home
- https://some.other.ip:8443/steve/manager/home
SOAP endpoint for OCPP
- http://some.ip:8180/steve/services/CentralSystemService
- http://some.other.ip:8180/steve/services/CentralSystemService
- https://some.ip:8443/steve/services/CentralSystemService
- https://some.other.ip:8443/steve/services/CentralSystemService
WebSocket/JSON endpoint for OCPP
- ws://some.ip:8180/steve/websocket/CentralSystemService/(chargeBoxId)
- ws://some.other.ip:8180/steve/websocket/CentralSystemService/(chargeBoxId)
- wss://some.ip:8443/steve/websocket/CentralSystemService/(chargeBoxId)
- wss://some.other.ip:8443/steve/websocket/CentralSystemService/(chargeBoxId)

Killed
$

Tailing the log:

[INFO ] 2023-05-15 02:31:12,913 de.rwth.idsg.steve.ocpp.ws.WebSocketLogger - [chargeBoxId=SOMEID, sessionId=2761c2ed-82d5-73e8-1396-bd75d558c76b] Received: [2,"5C986B29-F13D-5F9C-001C-AC6417A65E93","Heartbeat",{}]
[INFO ] 2023-05-15 02:31:12,924 de.rwth.idsg.steve.ocpp.ws.WebSocketLogger - [chargeBoxId=SOMEID, sessionId=2761c2ed-82d5-73e8-1396-bd75d558c76b] Sending: [3,"5C986B29-F13D-5F9C-001C-AC6417A65E93",{"currentTime":"2023-05-15T02:31:12.916Z"}]
[INFO ] 2023-05-15 02:36:13,270 de.rwth.idsg.steve.ocpp.ws.WebSocketLogger - [chargeBoxId=SOMEID, sessionId=2761c2ed-82d5-73e8-1396-bd75d558c76b] Received: [2,"743D072E-413D-5F9C-001C-AC6417A65E93","Heartbeat",{}]
[INFO ] 2023-05-15 02:36:13,292 de.rwth.idsg.steve.ocpp.ws.WebSocketLogger - [chargeBoxId=SOMEID, sessionId=2761c2ed-82d5-73e8-1396-bd75d558c76b] Sending: [3,"743D072E-413D-5F9C-001C-AC6417A65E93",{"currentTime":"2023-05-15T02:36:13.272Z"}]

I tried grepping for "Killed" under steve's source and couldn't find anything. Is it possible that it has received a SIGKILL somehow? I'm the only person accessing this server and I doubt anyone would be interesting in putting effort to hack into my boring experiments.

My quick dirty solution for now is to run steve as follows and pray that it won't get killed while someone is charging.

while :; do date; java -jar target/steve.jar; done

Please excuse my outdated versions, I'm a bit worried of ruining my work. I also haven't updated steve before posting this issue because reproducing it would probably take around another week. I'm not sure if this is a bug or if I'm doing something wrong. Any help would be appreciated.

Thanks

juherr commented 1 year ago

It could be a memory issue.

You won't find anything in Steve logs as the process was killed outside.

What are saying system logs like /var/log/messages?

Naheel-Azawy commented 1 year ago

Indeed, it was a memory issue. I don't have /var/log/messages, but here's what journal has to say (output trimmed):

May 15 02:38:07 somehost kernel: systemd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
May 15 02:38:07 somehost kernel: CPU: 0 PID: 1 Comm: systemd Not tainted 5.19.12-arch1-1 #1 2183db5e2ff49b915549bc42a3e56ec968f6996b
May 15 02:38:07 somehost kernel: Hardware name: Linode Compute Instance, BIOS Not Specified
May 15 02:38:07 somehost kernel: Call Trace:
May 15 02:38:07 somehost kernel:  <TASK>
May 15 02:38:07 somehost kernel:  dump_stack_lvl+0x48/0x60
May 15 02:38:07 somehost kernel:  dump_header+0x4a/0x1ff
May 15 02:38:07 somehost kernel:  oom_kill_process.cold+0xb/0x10
May 15 02:38:07 somehost kernel:  out_of_memory+0x27e/0x520
...
May 15 02:38:07 somehost kernel: Tasks state (memory values in pages):
May 15 02:38:07 somehost kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
...
May 15 02:38:07 somehost kernel: [2817714]  1000 2817714  1002391   425559  4227072    57903             0 java
...
May 15 02:38:07 somehost kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-12.scope,task=j>
May 15 02:38:07 somehost kernel: Out of memory: Killed process 2817714 (java) total-vm:4009564kB, anon-rss:1702224kB, file-rss:0kB, shmem-rss:12kB, UID:1000 pgtables:4128kB oom_score_adj:0
May 15 02:38:07 somehost systemd[1]: session-12.scope: A process of this unit has been killed by the OOM killer.

I tried checking looking at it now

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
me       3042185  0.6 53.6 3185848 1086580 pts/1 Sl+  May16   8:01 java -jar target/steve.jar

I restarted steve and monitored for a few seconds (header added for readability):

$ while :; do echo "$(date): $(ps aux | grep 'java .*steve' | head -n1)"; sleep 10; done
                                 USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
Wed May 17 09:48:15 AM UTC 2023: me       3055261 39.7 16.0 2676612 325028 pts/1  Sl+  09:47   0:25 java -jar target/steve.jar
Wed May 17 09:48:25 AM UTC 2023: me       3055261 34.5 16.4 2684804 333272 pts/1  Sl+  09:47   0:25 java -jar target/steve.jar
Wed May 17 09:48:35 AM UTC 2023: me       3055261 30.5 16.8 2684804 341320 pts/1  Sl+  09:47   0:25 java -jar target/steve.jar
Wed May 17 09:48:45 AM UTC 2023: me       3055261 27.4 17.2 2684804 349668 pts/1  Sl+  09:47   0:25 java -jar target/steve.jar
Wed May 17 09:48:55 AM UTC 2023: me       3055261 24.8 17.6 2684804 357912 pts/1  Sl+  09:47   0:25 java -jar target/steve.jar
Wed May 17 09:49:05 AM UTC 2023: me       3055261 22.7 18.0 2692996 366156 pts/1  Sl+  09:47   0:25 java -jar target/steve.jar
Wed May 17 09:49:15 AM UTC 2023: me       3055261 21.0 18.0 2692996 366156 pts/1  Sl+  09:47   0:26 java -jar target/steve.jar
Wed May 17 09:49:25 AM UTC 2023: me       3055261 19.5 18.4 2692996 374400 pts/1  Sl+  09:47   0:26 java -jar target/steve.jar
Wed May 17 09:49:35 AM UTC 2023: me       3055261 18.2 18.9 2758532 382672 pts/1  Sl+  09:47   0:26 java -jar target/steve.jar
Wed May 17 09:49:45 AM UTC 2023: me       3055261 17.1 19.3 2766724 390652 pts/1  Sl+  09:47   0:26 java -jar target/steve.jar
Wed May 17 09:49:55 AM UTC 2023: me       3055261 16.1 19.7 2766724 398860 pts/1  Sl+  09:47   0:26 java -jar target/steve.jar
Wed May 17 09:50:05 AM UTC 2023: me       3055261 15.2 20.0 2766724 406832 pts/1  Sl+  09:47   0:26 java -jar target/steve.jar

I haven't checked the consumption for long enough but this looks like a memory leak. Maybe related to #509 ?

juherr commented 1 year ago

Difficult to say if there is memory leak without profiling the service after many garbage collections. Check if you can go deeper on your side. https://www.baeldung.com/java-memory-leaks could help.

Naheel-Azawy commented 1 year ago

I wrote an ugly little script that restarts steve everyday 12am and collects memory usage every 30 minutes. This way I know steve will be killed before it eats the entire memory and I can keep track of what's happening, to some extent. As my current work is not at production level, I'll live with this as a workaround and close this issue in favor of #509. But there clearly is something wrong and it would be great if someone can afford time to resolve this.

For the record, I'll share my observations.

Virtual memory of steve for 4 days

plot

Virtual and resident memory of steve for the 18th of May

CSV

```csv 00:31:09,2795428,597672 01:01:09,2854820,760352 01:31:10,2937764,838300 02:01:10,2995108,798044 02:31:10,2945956,833160 03:01:10,3060644,923920 03:31:10,2937764,801712 04:01:10,2937764,826280 04:31:10,3052452,883744 05:01:10,3093412,931396 05:31:10,2986916,858144 06:01:10,3052452,936080 06:31:10,3068836,981660 07:01:10,2995108,938916 07:31:10,3167140,1067460 08:01:10,3052452,938404 08:31:10,3011232,955936 09:01:10,2986656,939844 09:31:10,3035808,985044 10:01:10,3117728,1007412 10:31:10,3117728,1002440 11:01:10,3068576,1018212 11:31:10,3125920,1067508 12:01:10,3052192,963520 12:31:10,3256992,1153996 13:01:10,3134112,971864 13:31:10,3060384,969848 14:01:10,3117728,961728 14:31:10,3068576,962184 15:01:10,3199648,1084772 15:31:10,3199648,1060132 16:01:10,3183264,1079308 16:31:10,3134112,1068940 17:01:10,3199648,1028268 17:31:10,3183264,1068724 18:01:10,3134112,1020372 18:31:10,3117728,1006344 19:01:10,3273376,1192060 19:31:10,3232416,1118924 20:01:10,3142304,1073384 20:31:10,3117728,1040016 21:01:10,3330720,1269136 21:31:10,3199648,1114104 22:01:10,3117728,1096992 22:31:10,3183264,1064612 23:01:10,3389088,1222660 23:31:10,3258016,1219940 ```

Script used

Script

```sh #!/bin/sh outdir=./steve-watch mkdir -p "$outdir" start=$(date) run=1 { while [ "$run" = 1 ]; do java -jar target/steve.jar & p=$! # shellcheck disable=SC2064 trap "kill $p" INT echo ">>> STEVE STARTED PID=$p" while [ -d /proc/$p ]; do echo ">>> MEM: $(ps -hup $p)" sleep 30m done & while [ -d /proc/$p ]; do h=$(date +%H) m=$(date +%M) if [ "$h" = 00 ] && [ "$m" = 00 ]; then kill $p echo ">>> KILLED STEVE PID=$p, waiting for a minute..." sleep 61 break fi sleep 10 done fg echo ">>> STEVE END PID=$p" done 2>&1 } | ts | tee "$outdir/$start" ```

juherr commented 1 year ago

@Naheel-Azawy That could be nice to have a view on the requests made by the station. Steve doesn't store them. It will help to reproduce the exact ocpp workflow in a JMH test.

steve-community / steve

Killed without a trace #1147

Checklist

Specifications

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional context

Virtual memory of steve for 4 days

Virtual and resident memory of steve for the 18th of May

Script used