pducharme / UniFi-Video-Controller

Docker for Unifi-Video Controller (Ubiquiti Networks)
199 stars 105 forks source link

Cameras become "disconnected" after upgrading from 3.9.7 to 3.9.8 #106

Closed jameslikeslinux closed 6 years ago

jameslikeslinux commented 6 years ago

I acknowledge that the image containing 3.9.8 is tagged 'beta,' but I thought I'd give it a try. After restarting my container with the new image, all of my cameras became disconnected. It seems to be related to an ems service failure noted in the system log (DEBUG=1):

1535477171.640 2018-08-28 13:26:11.640/EDT: ERROR  ems has quit with rc=139 in ems-service
1535477172.861 2018-08-28 13:26:12.861/EDT: ERROR  ems has quit with rc=139 in ems-service
1535477172.884 2018-08-28 13:26:12.884/EDT: ERROR  [EmsApiWebsocketClient] connect - FAILED connection to uri:wss://127.0.0.1:7440/ - DeploymentException - The HTTP request to initiate the WebSocket connection failed in EmsApiWebsocketClient-Connection
1535477174.888 2018-08-28 13:26:14.888/EDT: ERROR  [EmsApiWebsocketClient] connect - FAILED connection to uri:wss://127.0.0.1:7440/ - DeploymentException - The HTTP request to initiate the WebSocket connection failed in EmsApiWebsocketClient-Connection
1535477180.891 2018-08-28 13:26:20.891/EDT: ERROR  [EmsApiWebsocketClient] connect - FAILED connection to uri:wss://127.0.0.1:7440/ - DeploymentException - The HTTP request to initiate the WebSocket connection failed in EmsApiWebsocketClient-Connection
1535477182.864 2018-08-28 13:26:22.864/EDT: ERROR  Last-chance exception in ems-service
com.ubnt.airvision.service.ems.O0OO: Timeout executing request: 1
    at com.ubnt.airvision.service.ems.request.EmsRequest.getResponseDataAs(Unknown Source) ~[airvision.jar:?]
    at com.ubnt.airvision.service.ems.request.ShutdownServer.getResponse(Unknown Source) ~[airvision.jar:?]
    at com.ubnt.airvision.service.ems.o0OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO.void(Unknown Source) ~[airvision.jar:?]
    at com.ubnt.airvision.service.ems.C.run(Unknown Source) [airvision.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
    at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) ~[guava-14.0.1.jar:?]
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) ~[guava-14.0.1.jar:?]
    ... 5 more
1535477198.895 2018-08-28 13:26:38.895/EDT: ERROR  [EmsApiWebsocketClient] connect - FAILED connection to uri:wss://127.0.0.1:7440/ - DeploymentException - The HTTP request to initiate the WebSocket connection failed in EmsApiWebsocketClient-Connection
1535477202.872 2018-08-28 13:26:42.872/EDT: ERROR  Cannot send EMS CLI Request {"command":"shutdownServer","parameters":{"_messageId":1}}: Unable to establish websocket connection to send message in EmsCliApi-Executor
1535477241.387 2018-08-28 13:27:21.387/EDT: ERROR  HouseKeepingTask Error: Timeout executing request: 2 in StreamManagementService-HouseKeeper
1535477252.899 2018-08-28 13:27:32.899/EDT: ERROR  [EmsApiWebsocketClient] connect - FAILED connection to uri:wss://127.0.0.1:7440/ - DeploymentException - The HTTP request to initiate the WebSocket connection failed in EmsApiWebsocketClient-Connection
1535477261.389 2018-08-28 13:27:41.389/EDT: ERROR  Cannot send EMS CLI Request {"command":"listConfig","parameters":{"_messageId":2}}: Unable to establish websocket connection to send message in EmsCliApi-Executor
1535477271.386 2018-08-28 13:27:51.386/EDT: ERROR  HouseKeepingTask Error: Timeout executing request: 3 in StreamManagementService-HouseKeeper
1535477291.390 2018-08-28 13:28:11.390/EDT: ERROR  Cannot send EMS CLI Request {"command":"listConfig","parameters":{"_messageId":3}}: Unable to establish websocket connection to send message in EmsCliApi-Executor
1535477301.387 2018-08-28 13:28:21.387/EDT: ERROR  HouseKeepingTask Error: Timeout executing request: 4 in StreamManagementService-HouseKeeper

Indeed, the service is not running:

root@52d54376c6d7:/# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 13:25 ?        00:00:00 /bin/bash /run.sh
root        98     1  0 13:25 ?        00:00:00 unifi-video -cwd /usr/lib/unifi-
unifi-v+    99    98 29 13:25 ?        00:00:38 unifi-video -cwd /usr/lib/unifi-
root       208     0  0 13:26 pts/0    00:00:00 /bin/bash
unifi-v+   255    99  1 13:26 ?        00:00:02 bin/mongod --nohttpinterface --b
root       498     1  0 13:27 ?        00:00:00 sleep 1
root       499   208  0 13:28 pts/0    00:00:00 ps -ef

I am able to roll back to a snapshot of my data and video volumes and launch the container using the 3.9.7 image to restore operation. I did not test a straight downgrade. The 3.9.7 container shows the extra "evostreamms" process that died on the new version:

root@dce1d00fdde5:/# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 13:32 ?        00:00:00 /bin/bash /run.sh
root        95     1  0 13:32 ?        00:00:00 unifi-video -cwd /usr/lib/unifi-
unifi-v+    96    95 20 13:32 ?        00:00:47 unifi-video -cwd /usr/lib/unifi-
root       194     0  0 13:32 pts/0    00:00:00 /bin/bash
unifi-v+   256    96  1 13:32 ?        00:00:02 bin/mongod --nohttpinterface --b
unifi-v+   288    96  1 13:32 ?        00:00:03 bin/evostreamms /usr/lib/unifi-v
root       681     1  0 13:36 ?        00:00:00 sleep 1
root       682   194  0 13:36 pts/0    00:00:00 ps -ef
Allram commented 6 years ago

I manually updated the FW on my cameras from 4.4.6 to 4.4.7 which is required to be able to communicate with controller 3.9.8 and they showed up as normal again.

FW can be found here: https://community.ubnt.com/t5/UniFi-Video-Blog/UniFi-Video-3-9-8-Release/ba-p/2466987 and you just enter the IP of the camera in your browser and update the FW from there.

jameslikeslinux commented 6 years ago

Thanks for the suggestion.

I have updated all of my cameras' firmware to 4.4.7 both before and after the upgrade and they remain in the "disconnected" state. I have observed that something about the evostreamms process is triggering a fault that kills the process:

> dmesg
...
[786805.741985] traps: evostreamms[22700] general protection ip:557705d4afd8 sp:7ffc29d542f0 error:0 in evostreamms[557705a66000+63a000]
[786807.020994] traps: evostreamms[23294] general protection ip:55e2b68fcfd8 sp:7fff6ca035d0 error:0 in evostreamms[55e2b6618000+63a000]
...

If I run the program manually after starting the container, with:

> docker exec -d unifi-video /usr/lib/unifi-video/bin/evostreamms --uid=unifi-video --gid=unifi-video /usr/lib/unifi-video/conf/evostream/config.lua

then the cameras spring back to life and everything works; but that is a hack, not a fix.

I tried adding launching the container with the --privileged option in case there were new security limitations, but it did not help.

I am launching the container as you have it documented, with the only deviation being that I run mine on a macvlan network adapter, so I don't deal with any port mapping. https://github.com/iamjamestl/puppet-nest/blob/62b6642616fc14814d59d2054ca5d02d80ab285b/manifests/unifi_video.pp#L45

My docker and host software are up-to-date.

> docker info
Containers: 8
 Running: 7
 Paused: 0
 Stopped: 1
Images: 10
Server Version: 18.03.1-ce
Storage Driver: zfs
 Zpool: falcon
 Zpool Health: ONLINE
 Parent Dataset: falcon/docker
 Space Used By Parent: 1771573760
 Space Available: 17958087168
 Parent Quota: no
 Compression: lz4
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489 (expected: 773c489c9c1b21a6d78b5c538cd395416ec50f88)
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: v0.16.1 (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.52-gentoo
Operating System: Gentoo/Linux
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 62.86GiB
Name: falcon
ID: 4ZDE:CDHF:EUUK:GEN2:NTHP:BZVZ:JLRH:NNGA:W5MT:XXEV:LFDW:NI35
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

My problem seems to exactly mirror a post about the upgrade to 3.92 that appears to have been resolved:

https://community.ubnt.com/t5/UniFi-Video/Problem-upgrading-to-3-92/td-p/2256402

Maybe a regression?

jameslikeslinux commented 6 years ago

I rebuilt this image on top of phusion/baseimage:0.10.2 and it is working perfectly now.

fryfrog commented 6 years ago

So it was an issue in baseimage 0.11? :/

fryfrog commented 6 years ago

And is the macvlan so that camera discovery works?

jameslikeslinux commented 6 years ago

@fryfrog, Indeed there does seem to be some incompatibility with baseimage 0.11 and my system. I will try to do some more digging.

I use the macvlan network adapter because I run UniFi Video on a different VLAN from the Docker host and it gives my container a real IP address on my network which keeps video traffic off of my router, eliminates port conflicts with other containers on the host, and, as you said, enables camera discovery.

ruimarinho commented 6 years ago

No issues with 3.9.8/base image 0.11 here.

fryfrog commented 6 years ago

Yeah, same here which is why I haven't downgraded it. :/

fryfrog commented 6 years ago

@iamjamestl: Did you ever figure this out? Or are you just going to fork and stay on base image 0.10 forever? :/

jameslikeslinux commented 6 years ago

@fryfrog, This is turning out to be really hard to diagnose. The general protection fault is literally a fault at the instruction level on the CPU, and my attempts to observe the problem are causing it not to be triggered. For example, I wrapped the faulty process with valgrind and, while it spits out some potential problems, the process doesn't trigger the general protection trap and it seems to run fine.

This is what valgrind spits out:

==361== Memcheck, a memory error detector
==361== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==361== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==361== Command: /usr/lib/unifi-video/bin/evostreamms.real /usr/lib/unifi-video/conf/evostream/config.lua
==361==
==361== Invalid read of size 4
==361==    at 0x3F36F4: ??? (in /usr/lib/unifi-video/bin/evostreamms.real)
==361==    by 0x220AC6: ??? (in /usr/lib/unifi-video/bin/evostreamms.real)
==361==    by 0x2249D9: ??? (in /usr/lib/unifi-video/bin/evostreamms.real)
==361==    by 0x22CEDB: ??? (in /usr/lib/unifi-video/bin/evostreamms.real)
==361==    by 0x1A86FD: ??? (in /usr/lib/unifi-video/bin/evostreamms.real)
==361==    by 0x1A6A7D: ??? (in /usr/lib/unifi-video/bin/evostreamms.real)
==361==    by 0x5BC4B96: (below main) (libc-start.c:310)
==361==  Address 0x6041b50 is 32 bytes before a block of size 48 in arena "client"
==361==

Without debugging symbols I can't do much more. I suspect an incompatibility with the glibc on the container vs my specific kernel or processor which is why this isn't affecting other people. I am going to try some more things and maybe keep this ticket open for a few more days in case others are affected, but I will close it later if I don't find anything else out.

jameslikeslinux commented 6 years ago

Issue still present in 3.9.9. I will just continue building my own image on top of baseimage 0.10.2 (Ubuntu 16.04) until UniFi Video supports Ubuntu 18.04.