ryansch / docker-unifi-rpi

Unifi Controller for Raspberry Pi 3+ and x86
https://hub.docker.com/r/ryansch/unifi-rpi/
347 stars 42 forks source link

64-bit versions don't work on Raspberry Pi 3B+ with 64-bit Raspbian #100

Closed johnboiles closed 8 months ago

johnboiles commented 1 year ago

Looks like the versions after 7.0 now specify -Xmx1024M which is more RAM than is available to a Pi 3B.

So are Pi 3B's no longer powerful enough to run more recent Unifi Controllers? If so, it might be worth mentioning somewhere. I re-flashed my Pi 3B with a 64bit OS so I could pull the latest docker-unifi-rpi.

Looks like 512MB was originally adopted as a default here: https://github.com/ryansch/docker-unifi-rpi/issues/41

johnboiles commented 1 year ago

Ah I see compose.yml that I copy/pasted from the setup instructions includes a variable for this JAVA_OPTS: -Xmx1024M. I set that to 512M and sudo systemctl restart unifi.

And it still died (I can reliably trigger the pi locking up by attempting to download a 7 day backup).

Strangely though I see java still using > 512M. Hmm

image
johnboiles commented 1 year ago

Just downgraded to 7.0.23 and restored my pre-64-bit backup (6.5.55) and it seems to be correctly capping at 512MB. I can even download full backups :)

Is 512MB too little for the 7.3 series? Was there a step other than sudo systemctl restart unifi that is needed to pick up a changed JAVA_OPTS environment variable in compose.yml?

johnboiles commented 1 year ago

7.0.23 still froze overnight hmm. Going back to 6.5.55 (but now on a 64-bit OS) as a control to make sure it's not some quirk of 64 bit Raspbian lite. I guess if all else fails I'll go back to my rock-solid 32-bit install.

ryansch commented 1 year ago

@johnboiles I've actually been wondering that myself. I'm only running one small workload on a 3, everything else is a pi 4 in my home lab now.

If you get it working on just 512MB RAM, please post about it. If you can't and give up, please let me know that too.

I really wish we could have better support for 32-bit pis (arm32v7) but I can't get a version of mongo that will work!

johnboiles commented 1 year ago

I'm definitely regretting not getting an extra Pi4 before the shortages! I'd definitely prefer arm64 even on a Pi3 if possible. Nearly all my computing devices in the house are arm64 these days.

I'm back on 6.5.55 now. I'll report back if it's stable. If it's not, then there's something deeper going on since I ran the 32-bit 6.5.55 without issue since you released it in 2021 (and even ran several other workloads on that Pi at the same).

johnboiles commented 1 year ago

The system requirements say 2GB which doesn't make me very hopeful https://help.ui.com/hc/en-us/articles/360012282453-UniFi-Network-Self-Hosting-your-UniFi-Network-Without-a-Console-Advanced-

johnboiles commented 1 year ago

Some unifi settings we might try tweaking in system.properties:

See also:

ryansch commented 1 year ago

Oh that's excellent work! If you get it working, I'd love to add some docs around this use case.

johnboiles commented 1 year ago

6.5.55 locked up (though it took a couple days this time). Looking back in /var/log/messages one of the last messages I see is Workqueue: events_freezable mmc_rescan. So possible my SD card (Samsung Endurance Pro 32GB) picked this moment to fail (maybe the extra writes from installing the new OS?) and so I'm possibly conflating issues here. I have a new SD card on order. The existing one is 4+ yrs old so it's put in its time I guess.

In my understanding linux should kill processes when it runs out of RAM. So the SD card explanation maybe makes more sense than OOM.

If it is my SD card, then possible I'm hitting it quicker with the 7.x versions because of #97.

While I wait for the new card I'm trying this in system.properties. Perhaps nojournal will lower the disk IO.

inform.max_keep_alive_requests=50
inform.num_thread=100
unifi.G1GC.enabled=true
unifi.db.extraargs=--quiet
unifi.db.nojournal=false
unifi.xms=256
unifi.xmx=512
johnboiles commented 1 year ago

So far so good on 6.5.55 with the old SD card still. I also set up Telegraf to report to InfluxDB. Any other stats you think would be relevant to collect/visualize to help debug this?

image
johnboiles commented 1 year ago

Yeah blue is IOWait CPU usage so the lockups are probably my SD card dying

image

I don't love that RAM usage is creeping ever upward, but maybe it's leveling off.

image
johnboiles commented 1 year ago

Ok just died. Looks like Mongo dialed up it's disk access read rate (write rate is pretty constant).

image

I'm also seeing mongo threads bump up in the same timeframe.

image

I visited the controller web interface about the time this begin spiking, though I'm not sure if it's causal. I'll swap SD cards and see how it goes!

johnboiles commented 1 year ago

Yeah not seeing nearly as much iowait cpu usage on the new card. Probably i just got unlucky. If this is stable for a day or two I'll move up to 7.3 and see if i can get it stable with the lower memory settings.

image
johnboiles commented 1 year ago

Huh no, woke up this morning to an again locked up pi. Looks like at ~midnight mongo just started hammering the disk with a bunch of reads and hasn't stopped:

image image

I do have auto backup on at midnight but only on Monday. I can run a backup manually just fine.

image
johnboiles commented 1 year ago

Ok for my text test I'm going to use the 32-bit linux/arm/v7 6.5.55 (which I've been using for a year) except on my 64-bit Pi OS and see if that solves it:

pi@unifi:/opt/unifi $ cat compose.yml
services:
  unifi:
    image: ryansch/unifi-rpi:6.5.55
    platform: linux/arm/v7
    ...same as before...
pi@unifi:/opt/unifi $ for i in `docker ps --format "{{.Image}}"` ; do docker image inspect $i --format "$i -> {{.Architecture}} : {{.Os}}" ;done
ryansch/unifi-rpi:6.5.55 -> arm : linux
instantlinux/nut-upsd:latest -> arm64 : linux

Edit: Hmm mongo doesn't even start up using the linux/arm/v7 image. Strange.

johnboiles commented 1 year ago

Well I just YOLO'd back up to 7.3 and same thing happened. Worked for a while, then mongo and Java reads spike for a while, then the system dies.

image

At this point, not sure how to debug. I dialed the unifi logs down to info but am not seeing anything. I guess I could dial them down to debug and look. The mongo log doesn't show much when it starts the heavy disk access. Looks like this:

2023-04-01T06:48:45.646+0000 I COMMAND  [conn24] command ace.device command: find { find: "device", filter: { mac: "fc:ec:da:44:84:10" }, projection: {}, limit: 1, singleBatch: true, $readPreference: { mode: "secondaryPreferred" }, $db: "ace" } planSummary: IXSCAN { mac: 1, site_id: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:2 nreturned:1 reslen:1937 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } protocol:op_query 24103ms
2023-04-01T06:48:45.646+0000 I WRITE    [conn12] update ace.user command: { q: { _id: ObjectId('5d1ef4b6e56755000192f71a') }, u: { _id: ObjectId('5d1ef4b6e56755000192f71a'), dev_id_override: 2813, fingerprint_override: true, first_seen: 1562309814, fixed_ip: "10.0.0.2", hostname: "house-pi", is_guest: false, is_wired: true, last_seen: 1680330572, mac: "dc:a6:32:02:73:58", name: "Home Assistant", network_id: "5abd38dd72d1f20001663fb9", noted: true, oui: "Raspberry Pi Trading Ltd", site_id: "5abd38d872d1f20001663faf", use_fixedip: true, usergroup_id: "", disconnect_timestamp: 1680331690 }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:2 keysDeleted:2 numYields:2 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 3 } } } 21698ms
2023-04-01T06:48:45.646+0000 I COMMAND  [conn4] command ace.alarm command: count { count: "alarm", query: { site_id: "5abd38d872d1f20001663faf", archived: false }, limit: 3000, $readPreference: { mode: "secondaryPreferred" }, $db: "ace" } planSummary: COUNT_SCAN { site_id: 1, archived: 1 } keysExamined:2 docsExamined:0 numYields:2 reslen:44 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } protocol:op_query 24103ms
2023-04-01T06:48:54.521+0000 I COMMAND  [conn12] command ace.$cmd command: update { update: "user", ordered: true, $db: "ace" } numYields:0 reslen:59 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 3 } } } protocol:op_query 32862ms
2023-04-01T06:48:55.956+0000 I COMMAND  [conn20] command ace_stat.stat_daily command: find { find: "stat_daily", filter: { site_id: "5b0fa38f5bd6e200017ec3fb", o: { $type: 2 }, oid: { $type: 2 } }, sort: { datetime: -1 }, projection: { datetime: 1, _id: 1 }, limit: 1, singleBatch: true, $readPreference: { mode: "secondaryPreferred" }, $db: "ace_stat" } planSummary: IXSCAN { site_id: 1, datetime: 1 } keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:1 nreturned:0 reslen:107 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 34905ms
2023-04-01T06:49:06.464+0000 I COMMAND  [conn18] command ace.networkconf command: find { find: "networkconf", filter: { site_id: "5abd38d872d1f20001663faf", purpose: "wan" }, sort: {}, projection: {}, limit: 3000, $readPreference: { mode: "secondaryPreferred" }, $db: "ace" } planSummary: IXSCAN { site_id: 1 } keysExamined:4 docsExamined:4 cursorExhausted:1 numYields:5 nreturned:1 reslen:673 locks:{ Global: { acquireCount: { r: 12 } }, Database: { acquireCount: { r: 6 } }, Collection: { acquireCount: { r: 6 } } } protocol:op_query 42995ms
2023-04-01T06:49:17.329+0000 I COMMAND  [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 171ms
2023-04-01T06:49:21.667+0000 I COMMAND  [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 292ms
2023-04-01T06:49:25.516+0000 I COMMAND  [PeriodicTaskRunner] task: UnusedLockCleaner took: 129ms
2023-04-01T06:49:26.879+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 1585, after asserts: 3422, after backgroundFlushing: 3828, after connections: 4031, after dur: 4119, after extra_info: 4886, after globalLock: 5017, after locks: 5380, after logicalSessionRecordCache: 5719, after network: 5981, after opLatencies: 6390, after opReadConcernCounters: 7990, after opcounters: 10045, after opcountersRepl: 12521, after oplogTruncation: 15956, after repl: 17666, after security: 19490, after storageEngine: 22360, after tcmalloc: 24662, after transactions: 26182, after transportSecurity: 27837, after wiredTiger: 31602, at end: 35880 }
2023-04-01T06:49:45.964+0000 I WRITE    [conn22] remove ace_stat.stat_dpi command: { q: { site_id: "5abd38d872d1f20001663faf", o: "dpi", oid: "60:8b:0e:6b:46:df" }, limit: 0 } planSummary: IXSCAN { site_id: 1, o: 1, oid: 1, datetime: 1, app: 1, cat: 1 } keysExamined:10 docsExamined:10 ndeleted:10 keysDeleted:20 numYields:11 locks:{ Global: { acquireCount: { r: 12, w: 12 } }, Database: { acquireCount: { w: 12 } }, Collection: { acquireCount: { w: 12 } } } 178528ms
2023-04-01T06:49:55.067+0000 I COMMAND  [conn22] command ace_stat.$cmd command: delete { delete: "stat_dpi", ordered: false, $db: "ace_stat" } numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 12, w: 12 } }, Database: { acquireCount: { w: 12 } }, Collection: { acquireCount: { w: 12 } } } protocol:op_query 189280ms
2023-04-01T06:50:04.049+0000 I COMMAND  [conn20] command ace.rogueknown command: find { find: "rogueknown", filter: { site_id: "5abd38d872d1f20001663faf" }, sort: {}, projection: {}, limit: 3000, $readPreference: { mode: "secondaryPreferred" }, $db: "ace" } planSummary: IXSCAN { site_id: 1 } keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:1 nreturned:0 reslen:102 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 20542ms
2023-04-01T06:50:04.103+0000 I WRITE    [conn12] update ace_stat.stat_life command: { q: { site_id: "5abd38d872d1f20001663faf", o: "user", oid: "dc:a6:32:02:73:58" }, u: { $set: { datetime: new Date(1680331769027) }, $inc: { duration: 402186, tx_bytes: 0, tx_packets: 0, wifi_tx_attempts: 0, rx_bytes: 0, rx_packets: 0, bytes: 0 } }, multi: true, upsert: false } planSummary: IXSCAN { site_id: 1, o: 1, oid: 1, datetime: 1 } keysExamined:0 docsExamined:0 nMatched:0 nModified:0 numYields:2 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 3 } } } 19348ms
2023-04-01T06:50:07.551+0000 I COMMAND  [conn15] command ace.site command: find { find: "site", filter: { _id: ObjectId('5abd38d872d1f20001663faf') }, projection: {}, limit: 1, singleBatch: true, $readPreference: { mode: "secondaryPreferred" }, $db: "ace" } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:2 nreturned:1 reslen:261 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } protocol:op_query 23018ms
2023-04-01T06:50:07.551+0000 I COMMAND  [conn16] command ace.setting command: find { find: "setting", filter: { key: "locale", site_id: "5abd38d872d1f20001663faf" }, projection: {}, limit: 1, singleBatch: true, $readPreference: { mode: "secondaryPreferred" }, $db: "ace" } planSummary: IXSCAN { key: 1, site_id: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:2 nreturned:1 reslen:212 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } protocol:op_query 23018ms
2023-04-01T06:50:07.551+0000 I COMMAND  [conn18] command ace.wlanconf command: find { find: "wlanconf", filter: { _id: ObjectId('5ad035631a9eb4000135970f'), site_id: "5abd38d872d1f20001663faf" }, projection: {}, limit: 1, singleBatch: true, $readPreference: { mode: "secondaryPreferred" }, $db: "ace" } planSummary: IXSCAN { _id: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:2 nreturned:1 reslen:1397 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } protocol:op_query 23018ms
2023-04-01T06:50:09.949+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 968, after asserts: 2480, after backgroundFlushing: 3761, after connections: 5464, after dur: 6667, after extra_info: 6702, after globalLock: 6897, after locks: 8491, after logicalSessionRecordCache: 9605, after network: 12154, after opLatencies: 13469, after opReadConcernCounters: 14197, after opcounters: 14396, after opcountersRepl: 14396, after oplogTruncation: 15330, after repl: 16091, after security: 16195, after storageEngine: 17994, after tcmalloc: 19459, after transactions: 20207, after transportSecurity: 20562, after wiredTiger: 21248, at end: 29958 }
2023-04-01T06:50:10.355+0000 I COMMAND  [conn12] command ace_stat.$cmd command: update { update: "stat_life", ordered: false, $db: "ace_stat" } numYields:0 reslen:59 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 3 } } } protocol:op_query 28071ms
2023-04-01T06:50:12.323+0000 I COMMAND  [conn24] command ace_stat.stat_hourly command: find { find: "stat_hourly", filter: { site_id: "5b0fa38f5bd6e200017ec3fb", o: { $type: 2 }, oid: { $type: 2 }, datetime: { $gte: new Date(1680220800000), $lt: new Date(1680307200000) } }, sort: {}, projection: {}, limit: 3000, $readPreference: { mode: "secondaryPreferred" }, $db: "ace_stat" } planSummary: IXSCAN { datetime: 1, o: 1, site_id: 1, oid: 1 } keysExamined:1 docsExamined:0 cursorExhausted:1 numYields:2 nreturned:0 reslen:108 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collection: { acquireCount: { r: 3 } } } protocol:op_query 28594ms
2023-04-01T06:50:29.930+0000 I COMMAND  [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 283ms
2023-04-01T06:50:29.932+0000 I NETWORK  [conn22] end connection 127.0.0.1:50068 (16 connections now open)
2023-04-01T06:50:29.932+0000 I NETWORK  [conn21] end connection 127.0.0.1:41956 (17 connections now open)
2023-04-01T06:50:34.510+0000 I COMMAND  [PeriodicTaskRunner] task: DBConnectionPool-cleaner took: 134ms
2023-04-01T06:50:37.989+0000 I COMMAND  [PeriodicTaskRunner] task: UnusedLockCleaner took: 430ms
2023-04-01T06:50:41.037+0000 I WRITE    [conn25] update ace.device command: { q: { site_id: "5abd38d872d1f20001663faf", mac: "b4:fb:e4:80:eb:c9" }, u: { $set: { inform_url: "http://10.0.0.6:8080/inform" } }, multi: false, upsert: false } planSummary: IXSCAN { site_id: 1 } keysExamined:5 docsExamined:5 nMatched:1 nModified:1 numYields:8 locks:{ Global: { acquireCount: { r: 9, w: 9 } }, Database: { acquireCount: { w: 9 } }, Collection: { acquireCount: { w: 9 } } } 57027ms
2023-04-01T06:50:43.370+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 1331, after asserts: 2379, after backgroundFlushing: 2618, after connections: 2826, after dur: 2826, after extra_info: 3331, after globalLock: 3857, after locks: 4988, after logicalSessionRecordCache: 6127, after network: 7713, after opLatencies: 9705, after opReadConcernCounters: 11537, after opcounters: 12007, after opcountersRepl: 12007, after oplogTruncation: 13657, after repl: 14790, after security: 15191, after storageEngine: 16524, after tcmalloc: 17180, after transactions: 17705, after transportSecurity: 17908, after wiredTiger: 19563, at end: 25736 }
2023-04-01T06:50:46.763+0000 I COMMAND  [conn25] command ace.$cmd command: update { update: "device", ordered: false, $db: "ace" } numYields:0 reslen:59 locks:{ Global: { acquireCount: { r: 9, w: 9 } }, Database: { acquireCount: { w: 9 } }, Collection: { acquireCount: { w: 9 } } } protocol:op_query 64089ms

That last one before it died said it took a query 64s.

johnboiles commented 1 year ago

I'm close to giving up here and going back to 32-bit and being forever stuck on 6.5.55 unless you have any ideas.

johnboiles commented 1 year ago

Interestingly, on 32-bit Raspbian 6.5.55 (the data on the right) mongo uses more memory but java uses significantly less. Not sure what that means. One hypothesis: perhaps if Mongo is starved for memory it reads the disk a ton trying to fill a cache it can never fill? Not sure why it would be different between 64-bit and 32-bit OS though.

image

Edit: I guess 64-bit pointers are twice as big as 32-bit pointers, which would explain Java's decrease in memory usage moving back to 32-bit. But it's a mystery to me why Mongo would be using more memory under 32-bit 🤷

johnboiles commented 1 year ago

Maybe you could run 32-bit Unifi in one container, and 64-bit Mongo in another. Not the most elegant solution perhaps but might work. At least should free up ~200MB RAM from java.

johnboiles commented 1 year ago

24hrs in an my Pi is very happy running as 32-bit, even on the original 4yr-old Samsung Endurance card. This is definitely a happy Pi:

image
spmason commented 1 year ago

I get random lockups when backing up on my Pi4, too - new Pi and new SD card so not related to that, it seems to be memory-related

Running with these docker flags "helps" it not to lock-up:

--env JAVA_OPTS="-Xmx512M" --memory 728m

It stops the lockups because it controls the containers runaway memory usage, which unchecked causes the OS to start swapping like crazy and "lock up".

It also means that nightly backups don't work, as I suspect it's these that are causing the high memory usage in the first place..

ryansch commented 1 year ago

Note: The newly released 7.5.x officially discontinues support for 32-bit deployments.

ryansch commented 8 months ago

I don't see a way to support older pis at this point.

johnboiles commented 8 months ago

Makes sense to me. Less of an issue now that you can buy new Pis again!