Closed Naheel-Azawy closed 1 year ago
It could be a memory issue.
You won't find anything in Steve logs as the process was killed outside.
What are saying system logs like /var/log/messages
?
Indeed, it was a memory issue. I don't have /var/log/messages, but here's what journal has to say (output trimmed):
May 15 02:38:07 somehost kernel: systemd invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
May 15 02:38:07 somehost kernel: CPU: 0 PID: 1 Comm: systemd Not tainted 5.19.12-arch1-1 #1 2183db5e2ff49b915549bc42a3e56ec968f6996b
May 15 02:38:07 somehost kernel: Hardware name: Linode Compute Instance, BIOS Not Specified
May 15 02:38:07 somehost kernel: Call Trace:
May 15 02:38:07 somehost kernel: <TASK>
May 15 02:38:07 somehost kernel: dump_stack_lvl+0x48/0x60
May 15 02:38:07 somehost kernel: dump_header+0x4a/0x1ff
May 15 02:38:07 somehost kernel: oom_kill_process.cold+0xb/0x10
May 15 02:38:07 somehost kernel: out_of_memory+0x27e/0x520
...
May 15 02:38:07 somehost kernel: Tasks state (memory values in pages):
May 15 02:38:07 somehost kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
May 15 02:38:07 somehost kernel: [2817714] 1000 2817714 1002391 425559 4227072 57903 0 java
...
May 15 02:38:07 somehost kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-12.scope,task=j>
May 15 02:38:07 somehost kernel: Out of memory: Killed process 2817714 (java) total-vm:4009564kB, anon-rss:1702224kB, file-rss:0kB, shmem-rss:12kB, UID:1000 pgtables:4128kB oom_score_adj:0
May 15 02:38:07 somehost systemd[1]: session-12.scope: A process of this unit has been killed by the OOM killer.
I tried checking looking at it now
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
me 3042185 0.6 53.6 3185848 1086580 pts/1 Sl+ May16 8:01 java -jar target/steve.jar
I restarted steve and monitored for a few seconds (header added for readability):
$ while :; do echo "$(date): $(ps aux | grep 'java .*steve' | head -n1)"; sleep 10; done
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
Wed May 17 09:48:15 AM UTC 2023: me 3055261 39.7 16.0 2676612 325028 pts/1 Sl+ 09:47 0:25 java -jar target/steve.jar
Wed May 17 09:48:25 AM UTC 2023: me 3055261 34.5 16.4 2684804 333272 pts/1 Sl+ 09:47 0:25 java -jar target/steve.jar
Wed May 17 09:48:35 AM UTC 2023: me 3055261 30.5 16.8 2684804 341320 pts/1 Sl+ 09:47 0:25 java -jar target/steve.jar
Wed May 17 09:48:45 AM UTC 2023: me 3055261 27.4 17.2 2684804 349668 pts/1 Sl+ 09:47 0:25 java -jar target/steve.jar
Wed May 17 09:48:55 AM UTC 2023: me 3055261 24.8 17.6 2684804 357912 pts/1 Sl+ 09:47 0:25 java -jar target/steve.jar
Wed May 17 09:49:05 AM UTC 2023: me 3055261 22.7 18.0 2692996 366156 pts/1 Sl+ 09:47 0:25 java -jar target/steve.jar
Wed May 17 09:49:15 AM UTC 2023: me 3055261 21.0 18.0 2692996 366156 pts/1 Sl+ 09:47 0:26 java -jar target/steve.jar
Wed May 17 09:49:25 AM UTC 2023: me 3055261 19.5 18.4 2692996 374400 pts/1 Sl+ 09:47 0:26 java -jar target/steve.jar
Wed May 17 09:49:35 AM UTC 2023: me 3055261 18.2 18.9 2758532 382672 pts/1 Sl+ 09:47 0:26 java -jar target/steve.jar
Wed May 17 09:49:45 AM UTC 2023: me 3055261 17.1 19.3 2766724 390652 pts/1 Sl+ 09:47 0:26 java -jar target/steve.jar
Wed May 17 09:49:55 AM UTC 2023: me 3055261 16.1 19.7 2766724 398860 pts/1 Sl+ 09:47 0:26 java -jar target/steve.jar
Wed May 17 09:50:05 AM UTC 2023: me 3055261 15.2 20.0 2766724 406832 pts/1 Sl+ 09:47 0:26 java -jar target/steve.jar
I haven't checked the consumption for long enough but this looks like a memory leak. Maybe related to #509 ?
Difficult to say if there is memory leak without profiling the service after many garbage collections. Check if you can go deeper on your side. https://www.baeldung.com/java-memory-leaks could help.
I wrote an ugly little script that restarts steve everyday 12am and collects memory usage every 30 minutes. This way I know steve will be killed before it eats the entire memory and I can keep track of what's happening, to some extent. As my current work is not at production level, I'll live with this as a workaround and close this issue in favor of #509. But there clearly is something wrong and it would be great if someone can afford time to resolve this.
For the record, I'll share my observations.
@Naheel-Azawy That could be nice to have a view on the requests made by the station. Steve doesn't store them. It will help to reproduce the exact ocpp workflow in a JMH test.
Checklist
Specifications
Expected Behavior
The server not getting killed for no known reason
Actual Behavior
The server gets killed with no clear reason to be found in the logs.
I have been running steve since October last year for some experiments. Most of the experiments were done with limited amount of time and on experimental custom built setups. Around 2 weeks ago, I connected a commercial charger to the server, Kostad CPC50, which is running 24/7. Around a week ago (rough estimate as I don't have any records for that), I noticed that steve is dead. I started it again without overthinking. Today, I noticed the same thing happening again.
Steps to Reproduce the Problem
Additional context
I have steve running on tmux, it can be seen that it says "Killed":
Tailing the log:
I tried grepping for "Killed" under steve's source and couldn't find anything. Is it possible that it has received a SIGKILL somehow? I'm the only person accessing this server and I doubt anyone would be interesting in putting effort to hack into my boring experiments.
My quick dirty solution for now is to run steve as follows and pray that it won't get killed while someone is charging.
Please excuse my outdated versions, I'm a bit worried of ruining my work. I also haven't updated steve before posting this issue because reproducing it would probably take around another week. I'm not sure if this is a bug or if I'm doing something wrong. Any help would be appreciated.
Thanks