tsduck / tsduck

MPEG Transport Stream Toolkit
https://tsduck.io
BSD 2-Clause "Simplified" License
835 stars 210 forks source link

ip output continuity loss #610

Open tarelda opened 4 years ago

tarelda commented 4 years ago

I am experiencing frame loss in my stream while processing it on machine A. I usually develop my solutions on machine B which with the same pipeline yields packet loss free stream. These are similiar supermicro servers connected to exactly the same brand and model switch. Difference are in CPUs (E5-2620v2 vs E5-2650) and RAM memory (32GB vs 64GB). Both run Ubuntu 18.04 with latest stable docker. In my setup tsduck runs in container that has interface binded via macvlan to vlan interface on physical interface (ethx.vid). How I found out about continuity loss? I run continuity plugin on client machine and I got information that every few seconds a few packets are lost mostly in video pid but also in audio. To further debug it I added continuity plugin into my pipeline just before -O ip. It shows no discontinuity. I checked interface counters on linux for any errors, discards, overflows etc. I used ethtool -S, netstat -i udp, /proc/net/snmp and /proc/[PID]/net/snmp. Nothing found there. But interestingly enough stats for switch interface that machine A uses shows some discards, but nowhere near discontinuity severity.

Packets Received Without Error................. 7511703
Packets Received With Error.................... 0
Broadcast Packets Received..................... 0
Receive Packets Discarded...................... 226
Packets Transmitted Without Errors............. 13168206
Transmit Packets Discarded..................... 0
Transmit Packet Errors......................... 0
Collision Frames............................... 0
Number of link down events..................... 0
Load Interval.................................. 5
Bits Per Second Received....................... 28330432
Bits Per Second Transmitted.................... 46793368
Packets Per Second Received.................... 2593
Packets Per Second Transmitted................. 4284
Time Since Counters Last Cleared............... 0 day 0 hr 52 min 37 sec

This switch's uplink is connected to switch which port's counters interestingly show some input errors:

GigabitEthernet1/0/26 is up, line protocol is up (connected) 
  Hardware is Gigabit Ethernet, address is 1cde.a7a4.3f9a (bia 1cde.a7a4.3f9a)
  Description: 
  MTU 1500 bytes, BW 1000000 Kbit/sec, DLY 10 usec, 
     reliability 255/255, txload 15/255, rxload 6/255
  Encapsulation ARPA, loopback not set
  Keepalive not set
  Full-duplex, 1000Mb/s, link type is auto, media type is 1000BaseLX SFP
  input flow-control is off, output flow-control is unsupported 
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input never, output 00:00:08, output hang never
  Last clearing of "show interface" counters 00:18:27
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 26402000 bits/sec, 2409 packets/sec
  5 minute output rate 61383000 bits/sec, 5613 packets/sec
     2687022 packets input, 3670212357 bytes, 0 no buffer
     Received 2687022 broadcasts (2687022 multicasts)
     0 runts, 0 giants, 0 throttles 
     90 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 2687022 multicast, 0 pause input
     0 input packets with dribble condition detected
     6222078 packets output, 8496007142 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 unknown protocol drops
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 pause output
     0 output buffer failures, 0 output buffers swapped out

Also I tried to: a) increase backlog, udp mem buffers and net buffers, but with no effect, b) play with ip output burst settings, but it only increased the problem.

Any ideas what I should look for?

tarelda commented 4 years ago

I deployed machine C that is basically machine A from different vendor. It has no other services running and doesn't exhibit the mentioned issue. All without sysctl tuning. Just stock Ubuntu 18.04. To further verify the issue existence I tested stream continuity on switch that machine A is connected to and issue is present there too.

tarelda commented 4 years ago

Small followup. Recently I redeployed another docker based machine which also is used also for heavy traffic role (monitoring) and started suffering from similiar issues. What is interesting, change I made except from upgrading xenial to bionic was switching from OVS based networking to MacVlan. ATM working hipotesis is that MacVlan driver is a culprit in deployments where there are loads of traffic.