jitsi / jitsi-videobridge

Jitsi Videobridge is a WebRTC compatible video router or SFU that lets build highly scalable video conferencing infrastructure (i.e., up to hundreds of conferences per server).
https://jitsi.org/jitsi-videobridge
Apache License 2.0
2.9k stars 989 forks source link

jvb leaking file descriptors #245

Closed gardenia closed 7 years ago

gardenia commented 8 years ago

jvb_leak.txt I am running jitsi-videobridge (726-1) on Debian Jessie with openjdk-8 (8u72-b15-1~bpo8+1)

I am seeing a problem with file desctirptor leakage until the process eventiually maxs out its ulimit (4096) and becomes wedged (needs restarted).

it seems to leak approx 16 FDs every 10 seconds or so.

I establish this by running: lsof -p $(pgrep -u jvb)

from which I can see an ever-growing list of these:

java 1499 jvb 2337u sock 0,7 0t0 1514853 can't identify protocol java 1499 jvb 2338u sock 0,7 0t0 1529373 can't identify protocol java 1499 jvb 2339u sock 0,7 0t0 1514860 can't identify protocol

are there any known issues in this area?

in the jvb logs I see the attached repeated pattern of activity that seems to correlate with the leakage.

it this a normal pattern? I'm pretty sure that there are actually no active conferences.

why would I keep seeing this logging line: JVB 2016-05-18 14:21:26.815 INFO: [226258] org.jitsi.videobridge.IceUdpTransportManager.info() Failed to connect IceUdpTransportManager: net.java.sip.communicator.service.protocol.OperationFailedException: TransportManager closed

this seems like a potential source of FD allocation.

if you can suggest extra logging I can enable or extra information I can prove or things I can dig into I'm happy to do so.

gardenia commented 8 years ago

From @bgrozev

Thank you for reporting this! It is not normal, and not known AFAIK. I checked some of our currently running machines, and I don't see any significant number of FDs for the bridge process. However, in the logs from our performance test environment I see consistently many (over 9000) open FDs at the end of the test.

I suspect that health-checks are what triggers the problem in your case (without any conferences). On a test machine, I observe exactly 16 new FDs right after I run a health-check against a bridge[0], and jicofo sends health-check requests once every 10 seconds by default, so it seems consistent with your observations. But in my case they are eventually closed after 5-6 minutes.

You can disable health checks from jicofo by adding this to /etc/jitsi/jicofo/sip-communicator.properties: org.jitsi.jicofo.HEALTH_CHECK_INTERVAL=-1

gardenia commented 8 years ago

I might have figured out why I'm having the issue on this particular machine. In the logfile snippet I sent you (initial issue description) I noticed that there were 2 distinct IPs referenced in the "org.ice4j.ice.Agent.createComponent" logging lines. It turns out that particular machine has a second virtual IP (eth0:1) which explains the second IP (i.e. the eth0 and eth0:1 public IPs both go to the same physical machine).

I contrasted this with the machine not experiencing the leak which only has a single NIC (eth0) and I wondered if the leak could be due to some circular problem relating to the 2 IPs reaching the same endpoint.

anyway, since I am not actively using the eth0:1 IP on the problem machine I disabled it and now I don't believe I am seeing the same leakage.

I will continue to monitor it and report back in a day or so if it continues to be stable. I'm not sure if this is a bug or not. it seems like it could happen to anyone on a machine with multiple NICs and in that regard does not seem like user error. I'm not sure of the nuts and bolts of why the secondary IP is problematic. perhaps it would be good if the IPs which are used by JVB could be overridden. I'm not sure.

gardenia commented 8 years ago

I ran the following script overnight (no jicofo just using curl to hit the heartbeat REST API).

#!/bin/bash

for i in `seq 1 10000`; do
    echo "$(date)] Iteration $i [FDS: $(lsof -u jvb 2>/dev/null | wc -l)]"
    curl http://localhost:8080/about/health
    sleep 3
done

attached is the jvb.log. the low-watermark seemed to be in the 700s when the script completed. jvb.log.gz lsof.txt

bgrozev commented 8 years ago

In my environment I don't see the same. I'm running health checks every 1s, and lsof every 30s, and the results so far (after about 2 hours) are here: https://docs.google.com/spreadsheets/d/1j1q2MH_9yqy5wbbU-eZsyeSGTYGwo6ruMV9pHni6rHk/edit?usp=sharing

I'll let it run longer and also run it on java 8.

gardenia commented 8 years ago

I just tried your approach of 1 second pauses between healthchecks:

#!/bin/bash

for i in `seq 1 10000`; do
    echo "$(date)] Iteration $i [FDS: $(lsof -u jvb 2>/dev/null | wc -l)]"
    curl http://localhost:8080/about/health
    sleep 1
done

I just hit max FDs.

resulting jvb.log. jvb2.log.gz

lsof: lsof.txt

FDs sampled rougly once per second: fds_sample.txt

any ideas?

I'm wondering about this piece of logging: INFO: [189614] org.jitsi.videobridge.IceUdpTransportManager.info() Failed to connect IceUdpTransportManager: net.java.sip.communicator.service.protocol.OperationFailedException: TransportManager closed

I'm wondering if each instance of this could result in leaked resources. it may also be totally unrelated. I really wish I was more skilled with java profiling tools. I played around with heap dumps but I can't really zone in on where the problem could be.

bgrozev commented 8 years ago

any ideas?

Not really. It must be related to some difference between the environments, but I have nothing concrete.

I'm wondering about this piece of logging: INFO: [189614] org.jitsi.videobridge.IceUdpTransportManager.info() Failed to connect IceUdpTransportManager: net.java.sip.communicator.service.protocol.OperationFailedException: TransportManager closed

I see the same message many times in my logs, too.

gardenia commented 8 years ago

@bgrozev it looks (from your sysctl output) as if you only have one interface (eth0) on the machine you are running this on. when I compare this against the machine I was doing the REST based heathchecks on I have multiple interfaces. I'm wondering if having multiple interfaces involved in each healthcheck somehow exaggerates the symtoms I am seeing compared to what you are seeing.

FWIW, the machine I am running this particular test on is Ubuntu 14.04 and it has 5 interfaces since I have both virtualbox and vmware on it (eth0, vboxnet0, virbr0, vmnet1 and vmnet8).

I will see if I can modify the jvb code to not include some of those interfaces in the healthcheck to see if it changes the outcome. if you do have a multiple interface machine it would help if you could try it on that.

bgrozev commented 8 years ago

True, the machine I used has one interface (although it has also has a virtual eth0:1 and also advertises it's public address found through the AWS API).

You can limit the interfaces used by jitsi-videobridge with a property, see here: https://github.com/jitsi/jitsi-videobridge/blob/master/doc/tcp.md#orgice4jiceharvestallowed_interfaces

gardenia commented 8 years ago

thanks. with this: org.ice4j.ice.harvest.ALLOWED_INTERFACES=eth0

the FDs growth is much slower but still seems to be apparent.

if you dont mind telling me: what type of AWS instance are you running? i.e. what AMI number and instance size. I will perhaps try to duplicate on the same setup to see if I can reconcile why the problem doesn't seem to be apparent there.

I'm wondering if AWS is somehow a special case because all the physical machines I have tried have exhibited the problem.

I guess the multiple interface thing could be reproduced on AWS by attaching a bunch of elastic IPs to the same instance as virtual ips (eth0:1, eth0:2 etc).

bgrozev commented 8 years ago

One of the machines is a t2.micro, I don't know the AMI number but it is running ubuntu 14.04.

gardenia commented 8 years ago

I guess it doesn't matter hugely but it would be good to know the exact ami ID just for uniformity. if you don't mind you could get it by running the following on the instance:

wget -q -O - http://instance-data/latest/meta-data/ami-id

Thanks.

bgrozev commented 8 years ago

ami-4d8b2626

bgrozev commented 7 years ago

These particular leaks have been fixed.