troglobit / pimd

PIM-SM/SSM multicast routing for UNIX and Linux
http://troglobit.com/projects/pimd/
BSD 3-Clause "New" or "Revised" License
194 stars 86 forks source link

PIMD causes random kernel panics: needs to validate interfaces before attempted use #218

Open MrPeteH opened 2 years ago

MrPeteH commented 2 years ago

My pfSense started reliably crashing (random but always within 1-2 hr of boot) with kernel panics pointing to pimd.

After much sleuthing, I discovered the root cause:

However, pimd doesn't ignore the invalid interfaces. It apparently attempts to use them or at least do something with them.

Most of the time, this doesn't cause an issue (I did run for about a year like this!) But sometimes it does... leading to a kernel panic.

I think I have backup copies of various config files if necessary.

(Note: I am still eventually hoping to track down #171 )

troglobit commented 2 years ago

I'm sorry to hear that, but kernel crashes are not really the responsibility of userland applications. The kernel should handle such issues. You have the option of course of disabling pimd on select interface (or disable all, and then selectively enable), which could be a way to work around this problem.

MrPeteH commented 2 years ago

Hmmm. That makes sense. (BTW, I do disable all and selectively enable. It's a bad *.conf file containing references to invalid interfaces due to having deleted some VLANs. Thus, pimd isn't validating interface references in the conf file? )

Yet, here's the strange thing I noted: this is a kernel panic in the "pim" area of the kernel.

Do I understand correctly that pimd is modifying various parameters of the built-in "pim" support?

On further reflection, presumably:

Does that make sense? Or am I simply out to lunch on this? ;)

troglobit commented 2 years ago

Yes correct. You are spot on. Regardless what little old pimd does, it shouldn't crash the kernel.

MrPeteH commented 2 years ago

So I will report this up-chain 😏

MrPeteH commented 2 years ago

On further reflection... it's TRUE that userland "shouldn't" crash the kernel. At the same time, defense-in-depth design suggests that all layers are well served by handling things well. Can't be much overhead to validate interfaces at startup time and remove (or complain about) any that are invalid.

troglobit commented 2 years ago

Of course. And at startup pimd (master) does this by calling getifaddrs(), comparing that with the interfaces enabled (from command line and) in pimd.conf. There is of course a certain window where interfaces may go missing between this initial probe and the milliseconds in between pimd actuall registers VIFs with the kernel to be used for multicast routing. At runtime, however, there is no check if interfaces suddenly go disappear. Here it really is up to the kernel to clean up any VIFs registered with phyiscal interfaces, and return EIO or similar for any outstanding recvfrom(), setsockopt(), or similar that have a socket open on any of these interfaces.

MrPeteH commented 2 years ago

In my case, we're not dealing with dynamic change. Six months ago I removed some VLANs, and thus the associated interfaces. For whatever reason, now having those in pimd.conf leads to kernel panics within seconds to hours of startup.

I hear you saying you already validate the interfaces. Should an invalid interface in pimd.conf...

Something strange certainly happens.

troglobit commented 2 years ago

You know, I'm not really sure what it is you expect I should do for you here in this bug report. You've not declared how you start pimd, or what your pimd.conf runs. I don't know what version of pimd you are using, or what patches pfSense may have applied to integrate it into FreeBSD. I don't use FreeBSD myself, so for the pimd project that's like a second or third tier platform.

There's lots of error handling in pimd, the code is here on GitHub for you to inspect and learn. Check out config.c, vif.c and kern.c and you can see the error handling and log levels used for each such message.

MrPeteH commented 2 years ago

I understand. Time for me to put on my coder/diagnosis hat and dig in. ;)