owntracks / recorder

Store and access data published by OwnTracks apps
Other
907 stars 123 forks source link

Slow Frontend Performance and ot-recorder 100% CPU #497

Closed Jachimo closed 1 month ago

Jachimo commented 2 months ago

I'll preface this by saying that I'm not entirely sure if this is a frontend issue or a recorder issue, but I'm leaning towards recorder so I'm reporting it here.

Behavior: When using the OwnTracks Frontend, the UI stays on "Loading data, please wait…" for many minutes at a time. While this is happening, ot-recorder on the server has very high CPU usage—typically it's pegged at 100%. After several minutes, the "Loading Data" UI element disappears from the client browser, but no new data is shown on the Frontend map. It seems to be timing out, I think?

Expected Behavior: The Frontend will load data from the server and display it, and the recorder backend will retrieve and present data to the API in enough time to prevent a timeout.

Test Setup: 2x processor, 2GB RAM, SSD-backed storage VPS (from RackNerd; pretty standard KVM Linux VPS), running Debian 12. Used exclusively for OwnTracks. Very little else installed. OwnTracks was installed/configured using the quicksetup Ansible scripts, and it's running recorder 0.9.8, frontend 2.15.3. Data is about one week's worth of location information from 3 devices; two devices were in Significant mode most of the time, one was in Move mode most of the time. Ping times from the client to the server are about 215 ms (min/avg/max/stddev = 205.382/215.504/276.041/20.332 ms).

Data seems to be correctly recorded to the backend files—running tail -f on an active device's .rec file (inside /var/spool/owntracks/recorder/rec/username/devicename/) shows entries being added periodically by devices. The device with the most recorded data (the one in Move mode most of the time) has approx 141k lines in its .rec file.

The 'ot-recorder' Systemd service is reported (via systemctl status) as "active (running)" with an uptime of 6 days, 514M of memory and 1 task. The mosquitto service is also running; uptime 6 days, 37M memory, 1 task. And nginx is running, but with a shorter active time (only about 15 hours) and 4.6M memory. I'm unclear whether the 514M of memory usage by ot-recorder is normal or represents some sort of memory leak.

I don't think this is a particularly slow or underpowered VPS. As a rough benchmark, it runs tar -cvzf test.tgz quicksetup (compressing the quicksetup repository checkout) in 0m0.065s real. Happy to run some better benchmarks if raw compute horsepower is potentially the problem.

Screen Shot 2024-09-17 at 7 54 40 PM
jpmens commented 2 months ago

approx 141k lines in its .rec file

that's quite a lot...

If you use Frontend to select a short time span, do you likewise experience the high CPU load on the Recorder machine?

Jachimo commented 2 months ago

@jpmens Yes, although (as one would expect) for less time.

But there was a threshold of data under which it would generally work fine, and over which it would fail, which turned out to be right around 10MB of JSON being returned by the locations API.

My current working theory is that this was due to the crummy cellular network I was using. I suspect that it was injecting RSTs after 10MB of transfer (slimy of them), thus causing the Frontend to request, and Recorder to process and send, the same data over and over until it timed out.

On a more reliable terrestrial connection, it doesn't seem to be reproducible. A request for a full day of data from a device in Move mode takes about 20 sec for the locations API call (Load 229 in the linked profile): 16 s "Waiting" (during which there is 100% CPU load on the Recorder) and 3.5 s "Receiving" (consistent with the speed of the connection) to deliver 14.3 MB of JSON to the Frontend. Total page load time is about 55-60 seconds, of which about 20 sec is load time (profiled with Firefox).

Is that basically normal/expected performance?

jpmens commented 2 months ago

Normal and expected depend on a lot of factors; you might also like to study this thread in which we performed some measurements of the API.

Jachimo commented 1 month ago

Noted, thanks for the help!