troglobit / pimd

PIM-SM/SSM multicast routing for UNIX and Linux
http://troglobit.com/projects/pimd/
BSD 3-Clause "New" or "Revised" License
194 stars 86 forks source link

Some VIF related issue "Invalid argument" #178

Closed LouisAtGH closed 3 years ago

LouisAtGH commented 3 years ago

I just installed the latest build on pFsense (FreeBSD).

PIMD working, however I immediately noticed an abnormal high memory uses. So I looked at the logs and saw thousands of vif argument related ("identical") messages.

A couple of examples below.

I checked syslog. The messages are occuring since I did updated half an hour ago to latest pfSense build using the same PIMD-build (yesterdays master branch) as yesterday.

So something is wrong, do not know what.

Let me know if I have to check or test something.

Louis

Example messages:

Nov   5 08:23:34 pfSense pimd[35519]: Failed adding MFC entry src  192.168.116.33 grp 239.255.255.250 from ix1.116 to lagg0.16,  register_vif0: Invalid argument
--
Nov   5 08:23:34 pfSense pimd[35519]: Failed adding MFC entry src  192.168.1.3 grp 239.255.255.250 from lagg0.16 to ix1.116, register_vif0:  Invalid argument
Nov   5 08:23:34 pfSense pimd[35519]: Failed adding MFC entry src  192.168.1.2 grp 239.255.255.250 from lagg0.16 to ix1.116, register_vif0:  Invalid argument
Nov   5 08:23:34 pfSense pimd[35519]: Failed adding MFC entry src  192.168.1.3 grp 239.255.255.250 from lagg0.16 to ix1.116, register_vif0:  Invalid argument
Nov   5 08:23:34 pfSense pimd[35519]: Failed adding MFC entry src  192.168.1.2 grp 239.255.255.250 from lagg0.16 to ix1.116, register_vif0:  Invalid argument
Nov  5 08:23:34 pfSense root[38078]: Bootup complete

Nov  5 08:50:09 pfSense pimd[35519]: Failed removing MFC entry src 192.168.1.36, grp 239.255.255.250: Invalid argument
--
Nov  5 08:50:09 pfSense pimd[35519]: Failed removing MFC entry src 192.168.1.3, grp 239.255.255.250: Invalid argument
Nov  5 08:50:09 pfSense pimd[35519]: Failed removing MFC entry src 192.168.1.2, grp 239.255.255.250: Invalid argument
Nov  5 08:50:09 pfSense pimd[35519]: pimd version 3.0-beta1 exiting.
LouisAtGH commented 3 years ago

I did select some more alarm ^types^ from the log (Below).

Note that:

Here the selected syslog messages

Nov 5 08:23:26 pfSense pimd[23546]: Failed adding MFC entry src 192.168.116.33 grp 239.255.255.250 from ix1.116 to register_vif0: Invalid argument
Nov 5 08:23:26 pfSense pimd[23546]: Failed adding MFC entry src 192.168.116.33 grp 239.255.255.250 from ix1.116 to lagg0.26, register_vif0: Invalid argument
Nov 5 08:23:29 pfSense pimd[23546]: Failed adding MFC entry src 192.168.116.33 grp 239.255.255.250 from ix1.116 to lagg0.26, lagg0.16, ix0.14, register_vif0: Invalid argument
Nov 5 08:23:31 pfSense pimd[23546]: pimd version 3.0-beta1 exiting.
Nov 5 08:23:31 pfSense pimd[23546]: Failed removing MFC entry src 192.168.1.2, grp 239.255.255.250: Invalid argument
Nov 5 08:23:31 pfSense pimd[23546]: Failed removing MFC entry src 192.168.1.3, grp 239.255.255.250: Invalid argument
Nov 5 08:23:31 pfSense pimd[23546]: Failed removing MFC entry src 192.168.116.33, grp 239.255.255.250: Invalid argument

Nov 5 08:23:51 pfSense pimd[35519]: Failed adding MFC entry src 192.168.1.3 grp 239.255.255.250 from lagg0.16 to lagg0.26, ix0.14, ix1.116, register_vif0: Invalid argument
Nov 5 08:23:49 pfSense pimd[35519]: Failed adding MFC entry src 192.168.1.2 grp 239.255.255.250 from lagg0.16 to lagg0.26, ix0.14, ix1.116, register_vif0: Invalid argument
Nov 5 08:25:15 pfSense pimd[35519]: Failed reading kernel count (SIOCGETSGCNT) for (S,G) on (192.168.1.2, 239.255.255.250): Can't assign requested address
Nov 5 08:25:15 pfSense pimd[35519]: Failed reading kernel count (SIOCGETSGCNT) for (S,G) on (192.168.1.36, 239.255.255.250): Can't assign requested address
Nov 5 08:27:10 pfSense pimd[35519]: Failed removing MFC entry src 192.168.116.33, grp 239.255.255.250: Invalid argument
Nov 5 08:33:05 pfSense pimd[35519]: Failed adding MFC entry src 192.168.14.15 grp 239.255.255.250 from ix0.14 to lagg0.26, lagg0.16, ix1.116, register_vif0: Invalid argument
troglobit commented 3 years ago

Have really no idea what's going on here. Maybe pimd starts too soon before interfaces are up properly? I noticed that pfSense root[38078]: Bootup complete is printed after a set of pimd log entries, possibly an ordering problem in pfSense. In mrouted we added a command line options for cases like this; -w, --startup-delay=SEC Startup delay before forwarding, which prevents the daemon from attempting to create VIFs before a certain (system/site specific) delay has passed. I've been meaning to add something like that to pimd as well, but haven't gotten around to it yet.

Could you please share your pimd.conf (if you have one) and the output from ifconfig when pimd starts. I'd like to rule out problems with configuration or system first.

LouisAtGH commented 3 years ago

There seems to be a path issue:

I did copy the correct pimd.conf to /usr/local/etc and started pimd again. Does not seem to solve the problem

Have to think about this.

LouisAtGH commented 3 years ago

I did not change the interface configuration or the pimd.conf for month. But of course

Updating pfSense release from a couple of weeks back to the actual version, did provoke this problem. Not saying it caused this problem.

I will rebuild the pimd related pfSense package just to be sure (I do not expect that to help, but just to be sure) Update: As expected that did not change/solve anything Update: PIMD has been running all day ("working" however generating errors all the time). and not working properly I think

Here the requested files.

Louis 20201106 pimd.conf.txt 20201106 ifconfig.txt

Adding part of syslog

20201106 part_syslog.txt

LouisAtGH commented 3 years ago

For info actual pimctl "default" report. Not sure the multicast routes are really there

PIMD Routes

VIF  Local Address    Subnet              Thresh  Flags      Neighbors
===============================================================================
  0  192.168.2.1      192.168.2                1  DR NO-NBR
  1  192.168.13.1     192.168.13               1  DR NO-NBR
  2  192.168.1.1      192.168.1                1  DR NO-NBR
  3  192.168.14.1     192.168.14               1  DR NO-NBR
  4  192.168.116.1    192.168.116              1  DR NO-NBR
  5  192.168.2.1      register_vif0            1 

Source           Group            RP Address       Flags              (*,G)
===============================================================================
*                239.255.255.250  192.168.14.15    WC RP
Joined   oifs: ......              
Pruned   oifs: ......              
Leaves   oifs: l.lll.              
Asserted oifs: ......              
Outgoing oifs: o.ooo.              
Incoming     : ...I..              

TIMERS:  Entry    JP    RS  Assert VIFS:  0  1  2  3  4  5
             0    35     0       0        0  0  0  0  0  0

Source           Group            RP Address       Flags              (S,G)
===============================================================================
192.168.1.3      239.255.255.250  192.168.14.15    SPT CACHE SG
Joined   oifs: .....j              
Pruned   oifs: ......              
Leaves   oifs: l.lll.              
Asserted oifs: ......              
Outgoing oifs: o..ooo              
Incoming     : ..I...              

TIMERS:  Entry    JP    RS  Assert VIFS:  0  1  2  3  4  5
           185    60     0       0        0  0  0  0  0  0

Source           Group            RP Address       Flags              (S,G)
===============================================================================
192.168.1.36     239.255.255.250  192.168.14.15    SPT CACHE SG
Joined   oifs: .....j              
Pruned   oifs: ......              
Leaves   oifs: l.lll.              
Asserted oifs: ......              
Outgoing oifs: o..ooo              
Incoming     : ..I...              

TIMERS:  Entry    JP    RS  Assert VIFS:  0  1  2  3  4  5
           170    25     0       0        0  0  0  0  0  0

Source           Group            RP Address       Flags              (S,G)
===============================================================================
192.168.1.42     239.255.255.250  192.168.14.15    SPT CACHE SG
Joined   oifs: .....j              
Pruned   oifs: ......              
Leaves   oifs: l.lll.              
Asserted oifs: ......              
Outgoing oifs: o..ooo              
Incoming     : ..I...              

TIMERS:  Entry    JP    RS  Assert VIFS:  0  1  2  3  4  5
           185    45     0       0        0  0  0  0  0  0

Source           Group            RP Address       Flags              (S,G)
===============================================================================
192.168.14.15    239.255.255.250  192.168.14.15    CACHE SG
Joined   oifs: .....j              
Pruned   oifs: ......              
Leaves   oifs: l.lll.              
Asserted oifs: ......              
Outgoing oifs: o.o.oo              
Incoming     : ...I..              

TIMERS:  Entry    JP    RS  Assert VIFS:  0  1  2  3  4  5
           185    50     0       0        0  0  0  0  0  0
Number of Groups: 1
Number of Cache MIRRORs: 4

RP address       Incoming  Group Prefix        Priority  Holdtime
===============================================================================
192.168.14.15    3                                       Static
                           239/8               1         Forever
169.254.0.1      32                                      Static
                           232/8               1         Forever

Current BSR address: 0.0.0.0
LouisAtGH commented 3 years ago

Perhaps I am terribly wrong, but I noticed that all alarms are refering to vif0, and vif0 is special I think, but the involved devices are not related to vif0 at all .....

As said perhaps very stupid remark ....

Louis

troglobit commented 3 years ago

No clue really what's going on. I'll have to see if I can set up something similar to see if I can reproduce. Just for information: the vif0 you mention is actually register_vif0, which is an interface pimd creates. It is used for the register tunnel towards the Rendez-vous Point (RP), so something fails when pimd tries to create routing entries to forward multicast to the RP.

LouisAtGH commented 3 years ago

If help full I can run a pimd instance with extra debug lines added ....

troglobit commented 3 years ago

Thanks, maybe later. Meanwhile, could you verify a theory for me? I suspect at least parts of the log messages you see are bogus due to the incoming multicast traffic having TTL=1. (No routing entry is actually set in kernel, and when trying to remove it we get these bogus messages trying to remove something that isn't there). You can check the TTL of the 239.255.255.250 group, on the incoming interface, using tcpdump with the -vvv flag, e.g.,

tcpdump -lni ix0.14 -vvv

The 239.255.255.250 group is used by Windows machines for their Simple Service Discovery Protocol (SSDP), which is a link-local (only) protocol similar to (but worse than) mDNS.

LouisAtGH commented 3 years ago

Here the dump (captured form commandline) 20201107 tcpdmup_lni_ix014_vvv.txt

Note that the connection towards media-server are not working, despite that pimd is doing something

troglobit commented 3 years ago

Weird, there was nothing about 239.255.255.250 in that dump. Could you run a tcpdump while starting up pimd? When it starts up it sends out a lot of packets, including an IGMP query, that should trigger responses from end devices.

LouisAtGH commented 3 years ago

Joachim,

I made two traces

debugfile_lagg0_16.txt debugfile_ix0_14.txt

LouisAtGH commented 3 years ago

For info

[2.5.0-DEVELOPMENT][admin@pfSense.lan]/root: netstat -g

IPv4 Virtual Interface Table Vif Thresh Local-Address Remote-Address Pkts-In Pkts-Out 0 240 0.248.255.255 453448649065698 67108859 2 188 0.248.255.255 1310722 18446741875912781824 3 125 0.254.255.255 18446735277684210752 18446744071578268000 5 184 0.254.255.255 18446741874691258560 18446741874691258544 7 212 255.255.255.255 0.0.0.0 207876617823 0

IPv4 Multicast Forwarding Table is empty

IPv6 Multicast Interface Table is empty

IPv6 Multicast Forwarding Table is empty

troglobit commented 3 years ago

OK, now there's relevant information in the tcpdump logfiles. Unfortunately my theory was wrong, the TTL is 4 ... :-/

As you can imagine, it's close to impossible to debug something as complex as this over email/github. So I set out to create a setup of my own, which looks something like this:

                   .--------.
                   | switch |
                   '--+--+--'
              .-------'  '-------.
              | 192.168.175.0/24 |
         .----+----.        .----+-----.
         | FreeBSD |        |  Alpine  |
         |   12.1  |        |   Linux  |
         '----+----'        '-----+----'
  10.0.0.0/24 |                   | 10.0.1.0/24 
         .----+----.         .----+----.
         |   PC0   |         |   PC1   |
         '---------'         '---------'

I use my own tool, mcjoin, to send and receive plain UDP multicast. I've tried sending multicast from PC0 --> PC1 and also switching the sender to verify that it works in the other direction as well, from PC1 --> PC0, and everything works fine. (Only had some initial problems with the Alpine Linux box, until i remembered the classic rp_filter setting in sysctl.)

Don't know what more I can do now to help you. Sorry.

LouisAtGH commented 3 years ago

Earlier today I raised the TTL from 1 (default) to 3 just to test, that did not change any thing. I just revert to 1.

If needed I can provide wireshark traces, if usefull Apart form pfSense I have other computers as well,

I do not have an opinion about the intended test setup, apart from the fact that it is 12.1 where pfSense and my development machine is on 12.2

Note that pfSense show [2.5.0-DEVELOPMENT][admin@pfSense.lan]/root: freebsd-version -kru 12.2-STABLE 12.2-STABLE 12.2-STABLE No Idea which release that is :)

My development machine is three times 12.2-RELEASE

troglobit commented 3 years ago

To route anything (unicast/multicast) the TTL must be >1, otherwise the kernel stack will not forward it. I was interested in seeing if the incoming multicast was eligible for routing or not, wanted to rule out some fault scenarios.

OK, I'll upgrade my FreeBSD machine to 12.2 and test again. That's a useful exercise anyway since I use if for other purposes as well. But if I still cannot reproduce, then it's not much I can do.

LouisAtGH commented 3 years ago

The TTL in the gui was three you see 4 so I think that 1 is two :) further on I agree

I noted something strange in the logging ..... <<<

  • pimd starting
  • "two" messages
  • pim exiting
  • many many messages following ..... hue it had been exiting !!??

Have a look also in the attach log system.zip

Nov 7 12:18:21 pfSense pimd[67371]: Failed adding MFC entry src 192.168.14.15 grp 239.255.255.250 from ix0.14 to lagg0.26, lagg0.16, ix1.116, register_vif0: Invalid argument Nov 7 12:18:21 pfSense pimd[67371]: Failed adding MFC entry src 192.168.14.15 grp 239.255.255.250 from ix0.14 to lagg0.26, lagg0.16, ix1.116, register_vif0: Invalid argument Nov 7 12:18:22 pfSense php-fpm[74587]: /index.php: Successful login for user 'admin' from: 192.168.116.33 (Local Database) Nov 7 12:18:29 pfSense pimd[67371]: Failed adding MFC entry src 192.168.1.3 grp 239.255.255.250 from lagg0.16 to lagg0.26, ix0.14, ix1.116, register_vif0: Invalid argument Nov 7 12:18:29 pfSense pimd[67371]: Failed adding MFC entry src 192.168.1.3 grp 239.255.255.250 from lagg0.16 to lagg0.26, ix0.14, ix1.116, register_vif0: Invalid argument Nov 7 12:18:30 pfSense pkg-static[68637]: pfSense-repo upgraded: 2.5.0.a.20201106.0650 -> 2.5.0.a.20201106.2150 Nov 7 12:18:34 pfSense pimd[67371]: Failed adding MFC entry src 192.168.1.3 grp 239.255.255.250 from lagg0.16 to lagg0.26, ix0.14, ix1.116, register_vif0: Invalid argument Nov 7 12:18:34 pfSense pimd[67371]: Failed adding MFC entry src 192.168.1.3 grp 239.255.255.250 from lagg0.16 to lagg0.26, ix0.14, ix1.116, register_vif0: Invalid argument Nov 7 12:18:38 pfSense check_reload_status[551]: Syncing firewall Nov 7 12:18:38 pfSense pimd[67371]: pimd version 3.0-beta1 exiting. Nov 7 12:18:38 pfSense pimd[67371]: Failed removing MFC entry src 192.168.1.3, grp 239.255.255.250: Invalid argument Nov 7 12:18:38 pfSense pimd[67371]: Failed removing MFC entry src 192.168.14.15, grp 239.255.255.250: Invalid argument Nov 7 12:18:41 pfSense pimd[86252]: pimd version 3.0-beta1 starting ... system.zip

troglobit commented 3 years ago

Have upgraded my FreeBSD machine to 12.2 now, and everything still works as expected. I'm suspecting the problems you see are related to the integration with pfSense, somehow. I have no clue as to what or why.

LouisAtGH commented 3 years ago

What do you think of my finding above?

I would like to know:

LouisAtGH commented 3 years ago

I will add debug statements to check why it is exiting

troglobit commented 3 years ago

The log message "pimd version 3.0-beta1 exiting" is not the last message, it is printed when pimd has received SIGTERM. So very normal shutdown sequence. It's likely pfSense that sends SIGTERM to restart pimd.

LouisAtGH commented 3 years ago

Perhaps I know, I did change the TTL, and I asume that pfSense is restarting after each configuration change :)

LouisAtGH commented 3 years ago

We had an other short discussion on max no of VIFS

--- just copied from the pfSense activity list ---- Hi Louis,

We applied the change on FreeBSD-src repository changing #define MAXVIFS 32 to 64. Did it break pimd? Xavier mentioned a new PR would be submitted for pimd part

9

Updated by Renato Botelho 12 days ago


Could that causing this ? Any how I will try to force max vif no to 64 and have a look what happens

LouisAtGH commented 3 years ago

I forced pimd to support 64 VIFS by changing configure.ac line 137/138 to 64. That works for git and ports.

However ..... that did not change the behavior / nor solve the problem

troglobit commented 3 years ago

That is very possible. I'd like to remind you of our discussion in #172 that increasing MAXVIFS to something >32 has only ever been tested on Linux.

Also, after trying a few times more, I realized I get a kernel crash with FreeBSD 12.2, which I never saw on 12.1:

Nov  7 18:40:24 beasdie kernel: Fatal trap 12: page fault while in kernel mode
Nov  7 18:40:24 beasdie kernel: cpuid = 0; apic id = 00
Nov  7 18:40:24 beasdie kernel: fault virtual address   = 0x0
Nov  7 18:40:24 beasdie kernel: fault code      = supervisor write data, page not present
Nov  7 18:40:24 beasdie kernel: instruction pointer = 0x20:0xffffffff80cba775
Nov  7 18:40:24 beasdie kernel: stack pointer           = 0x28:0xfffffe0000487700
Nov  7 18:40:24 beasdie kernel: frame pointer           = 0x28:0xfffffe0000487700
Nov  7 18:40:24 beasdie kernel: code segment        = base 0x0, limit 0xfffff, type 0x1b
Nov  7 18:40:24 beasdie kernel:             = DPL 0, pres 1, long 1, def32 0, gran 1
Nov  7 18:40:24 beasdie kernel: processor eflags    = interrupt enabled, resume, IOPL = 0
Nov  7 18:40:24 beasdie kernel: current process     = 12 (irq257: virtio_pci0)
Nov  7 18:40:24 beasdie kernel: trap number     = 12
Nov  7 18:40:24 beasdie kernel: panic: page fault
Nov  7 18:40:24 beasdie kernel: cpuid = 0
Nov  7 18:40:24 beasdie kernel: time = 1604770786
Nov  7 18:40:24 beasdie kernel: KDB: stack backtrace:
Nov  7 18:40:24 beasdie kernel: #0 0xffffffff80c0a8f5 at kdb_backtrace+0x65
Nov  7 18:40:24 beasdie kernel: #1 0xffffffff80bbeb1b at vpanic+0x17b
Nov  7 18:40:24 beasdie kernel: #2 0xffffffff80bbe993 at panic+0x43
Nov  7 18:40:24 beasdie kernel: #3 0xffffffff8108f911 at trap_fatal+0x391
Nov  7 18:40:24 beasdie kernel: #4 0xffffffff8108f96f at trap_pfault+0x4f
Nov  7 18:40:24 beasdie kernel: #5 0xffffffff8108efb6 at trap+0x286
Nov  7 18:40:24 beasdie kernel: #6 0xffffffff81066f38 at calltrap+0x8
Nov  7 18:40:24 beasdie kernel: #7 0xffffffff80cc6abb at if_simloop+0xdb
Nov  7 18:40:24 beasdie kernel: #8 0xffffffff8247f8e9 at pim_input+0x409
Nov  7 18:40:24 beasdie kernel: #9 0xffffffff80d461c1 at encap_input+0xd1
Nov  7 18:40:24 beasdie kernel: #10 0xffffffff80d460b8 at encap4_input+0x28
Nov  7 18:40:24 beasdie kernel: #11 0xffffffff80d48ab8 at ip_input+0x138
Nov  7 18:40:24 beasdie kernel: #12 0xffffffff80cdfbba at netisr_dispatch_src+0xca
Nov  7 18:40:24 beasdie kernel: #13 0xffffffff80cc2d98 at ether_demux+0x138
Nov  7 18:40:24 beasdie kernel: #14 0xffffffff80cc404b at ether_nh_input+0x33b
Nov  7 18:40:24 beasdie kernel: #15 0xffffffff80cdfbba at netisr_dispatch_src+0xca
Nov  7 18:40:24 beasdie kernel: #16 0xffffffff80cc31bb at ether_input+0x4b
Nov  7 18:40:24 beasdie kernel: #17 0xffffffff80a13a85 at vtnet_rxq_eof+0x7a5
Nov  7 18:40:24 beasdie kernel: Uptime: 40s
Nov  7 18:40:24 beasdie kernel: Dumping 83 out of 474 MB:..20%..39%..58%..77%..96%
Nov  7 18:40:24 beasdie kernel: Dump complete
Nov  7 18:40:24 beasdie kernel: Automatic reboot in 15 seconds - press a key on the console to abort
Nov  7 18:40:24 beasdie kernel: Rebooting...

Can you explain this? Shouldn't 12.2 be a stable release? (Rhetoric, don't answer that.)

Counting only today, I've spent 10h of my spare time over this weekend working for free on your issue. I've gotten nowhere, and my home network is now in worse shape than before. I've actually tried to reinstall FreeBSD, using12.2, since I thought the freebsd-update tool was somehow to blame for the instabilities, to no avail.

My takeaway from all this is that I should not waste more of my precious spare time on platforms I have no personal interest in. Unless someone steps up to pay me, of course, but that seems very unlikely given the 10+ years I've spent on this project so far. So unfortunately for you, and the rest of the UNIX world. I'm closing this bug report and will not be accepting new reports from non-Linux operating systems unless they show, without a doubt, the problem they have is directly related to pimd.