maddingue / POE-Component-NetSNMP-agent

AgentX clients with NetSNMP::agent and POE
http://search.cpan.org/dist/POE-Component-NetSNMP-agent/
2 stars 0 forks source link

100% CPU when snmpd is restarted #1

Open zrusilla opened 12 years ago

zrusilla commented 12 years ago

This bug concerns two distinct problems in our usage of this module.

ev_agent_check registers the AgentX sockets with POE. If snmpd is restarted, the socket goes away, the select() call in POE::Loop::Select returns -1 with an error of 'Bad file descriptor' and begins to spin furiously, occupying 100% of CPU. By the time ev_agent_check is called, it is already too late. I have not found a way to intervene and recover from the error.

If I substitute POE::Loop::EV, the spinning problem does not occur but POE still has the old, closed socket, not the new one connected by NetSNMP::agent.

I have presently worked around the problem by overriding ev_agent_check in a subclass:

sub ev_agent_check {
    my ($kernel, $heap) = @_[ KERNEL, HEAP, ARG0 ];
    $heap->{agent}->main_loop;
}

and by ensuring this function is the handler to agent_check in POE::Component::NetSNMP::agent

--- a/lib/POE/Component/NetSNMP/agent.pm
+++ b/lib/POE/Component/NetSNMP/agent.pm
@@ -48,6 +48,11 @@ sub spawn {
     # check arguments
     carp "warning: errback '$args{Errback}' doesn't look like a POE event"
         if $args{Errback} and $args{Errback} !~ /^\w+$/;
+        
+        
+        
+    # Bug workaround.
+    my $ev_agent_check = join '::', $class, 'ev_agent_check';

     # create the POE session
     my $session = $class->create(
@@ -62,7 +67,7 @@ sub spawn {
             _stop       => \&ev_stop,
             init        => \&ev_init,
             register    => \&ev_register,
-            agent_check => \&ev_agent_check,
+            agent_check => \&$ev_agent_check,

             tree_handler    => \&ev_tree_handler,
             add_oid_entry   => \&ev_add_oid_entry,

There does not appear to be an elegant way to know that NetSNMP::agent has reconnected, get those FDs, and register them with POE.

Please advise.

zrusilla commented 12 years ago

Okay, when I said

$heap->{agent}->main_loop;

I meant

$kernel->delay(agent_check => $heap->{ping_delay}),
$heap->{agent}->agent_check_and_process(0);

because if you call main_loop then control never goes back to POE, of course. D'oh. Carry on.

maddingue commented 11 years ago

Hello Elizabeth,

I've been working on this module for the past few days, needing it to write a program for $job.

I confirm that the FD handler in most POE loops begins to spin furiously when snmpd is stopped. But the situation gets back to normal as soon as snmpd is restarted. However, I've also found that the AgentX support seems very broken in recent versions of Net-SNMP. I'll make more tests and post here the results of my findings.

Sébastien Aperghis-Tramoni

Close the world, txEn eht nepO.

zrusilla commented 11 years ago

Hi Sebastien,

Thanks for following up. Yes, restarting snmpd is the workaround I suggested at Hebex too, but some people don't like that suggestion.

NetSNMP::agent is very frustrating in that it gives the user no indication that a socket has disconnected and reconnected. The assumption is that the user does not need this information, which is not the case here. POE::Select::Loop is very frustrating in that it doesn't allow you to intervene if select returns -1. Between the two of them, it's a mess.

Cheers,

Liz

On Nov 14, 2012, at 11:53 PM, Sébastien Aperghis-Tramoni wrote:

Hello Elizabeth,

I've been working on this module for the past few days, needing it to write a program for $job.

I confirm that the FD handler in most POE loops begins to spin furiously when snmpd is stopped. But the situation gets back to normal as soon as snmpd is restarted. However, I've also found that the AgentX support seems very broken in recent versions of Net-SNMP. I'll make more tests and post here the results of my findings.

Sébastien Aperghis-Tramoni

Close the world, txEn eht nepO. — Reply to this email directly or view it on GitHub.

maddingue commented 11 years ago

I had an idea last night, that I'll try today: maybe one of the problem is that I give to POE's kernel the file descriptor of the socket, instead of a copy (dup) of the file descriptor.

I read again Marc Lehmann's rant about all the other event framework in AnyEvent documentation, and that's something he mentions. And indeed, the only POE loops which does not go crazy when the AgentX socket is closed are EV and AnyEvent, so somehow, Marc did something right.

Sébastien Aperghis-Tramoni

Close the world, txEn eht nepO.

zrusilla commented 11 years ago

Hello Maddingue,

I also tried loading EV and I noticed the same thing: while it didnt solve the problem, it didn't go insane.

I was appalled when I read POE::Select::Loop::loop_do_timeslice. There is no way to specify a handler for a select() error? Really?? I couldn't believe it.

Part of the problem, too, is that by the time ev_agent_check is invoked, it's already too late: you're off to the races, spinning furiously.

I'm a fan of AnyEvent now. I wrote a project using it and Coro and it works like a charm without too much extra code clutter. Eric recently ported a program from Poe to AE and is pleased with the results, too.

Keep me posted (pun intended),

Liz

On Nov 15, 2012, at 8:50 AM, Sébastien Aperghis-Tramoni wrote:

I had an idea last night, that I'll try today: maybe one of the problem is that I give to POE's kernel the file descriptor of the socket, instead of a copy (dup) of the file descriptor.

I read again Marc Lehmann's rant about all the other event framework in AnyEvent documentation, and that's something he mentions. And indeed, the only POE loops which does not go crazy when the AgentX socket is closed are EV and AnyEvent, so somehow, Marc did something right.

Sébastien Aperghis-Tramoni

Close the world, txEn eht nepO. — Reply to this email directly or view it on GitHub.

maddingue commented 11 years ago

Zrusilla wrote:

Hello Maddingue,

Hello Elizabeth,

I also tried loading EV and I noticed the same thing: while it didnt solve the problem, it didn't go insane.

Hmm, what do you mean "it didn't solve the problem"? In my tests, once you restart snmpd, the subagent always reconnect to the socket.

Note that, along with POE::Loop::AnyEvent and POE::Loop::EV, POE::XS::Loop::EPoll prevents this spinlock problem. But.. it does not seem compatible with all versions of NetSNMP::agent..

I was appalled when I read POE::Select::Loop::loop_do_timeslice. There is no way to specify a handler for a select() error? Really?? I couldn't believe it.

Part of the problem, too, is that by the time ev_agent_check is invoked, it's already too late: you're off to the races, spinning furiously.

I know. At this level, the only thing we can do is to reduce the delay (default 10 sec) before calling agent_check so it can reconnect.

Also, I just tested and it appears that not dup-ing the file descriptor (i.e., changing line 183 from C< open my $fh, "+<&=", $fd; > to C< open my $fh, "+<&", $fd; > make things worse: even once the subagent reconnected the socket, the POE kernel spins because of the faulty file descriptor.

I'm a fan of AnyEvent now. I wrote a project using it and Coro and it works like a charm without too much extra code clutter. Eric recently ported a program from Poe to AE and is pleased with the results, too.

The problem I have with AnyEvent is that it looks made for writing programs, but not modules.

Sébastien Aperghis-Tramoni

Close the world, txEn eht nepO.

zrusilla commented 11 years ago

By "didn't solve the problem" I meant that while it didn't spin furiously, it did not solve the problem of supplying the correct FDs to the program, which is a separate problem. Perhaps the solution will be a combination of the two.

I am not sure what you mean by AnyEvent looks made for writing programs, not modules. Please elaborate.

Please keep me posted on what you find.

maddingue commented 11 years ago

Zrusilla a écrit :

By "didn't solve the problem" I meant that while it didn't spin furiously, it did not solve the problem of supplying the correct FDs to the program, which is a separate problem. Perhaps the solution will be a combination of the two.

Do you mean that once snmpd has restarted, and the sub-agent has reconnected, the requests aren't passed over to the sub-agent? In my tests, once reconnected, everything works fine.

I am not sure what you mean by AnyEvent looks made for writing programs, not modules. Please elaborate.

In the sense that I don't see classes, modules and objects, but instead many of these condvars, which are not easy to understand and don't look obvious to modularize.

Sébastien Aperghis-Tramoni

Close the world, txEn eht nepO.

maddingue commented 11 years ago

Just pushed in a new repository a first shot at porting this POE component to AnyEvent » https://github.com/maddingue/AnyEvent-NetSNMP-agent

Absolutely not tested as I don't even have NetSNMP::agent installed here.

Sébastien Aperghis-Tramoni

Close the world, txEn eht nepO.

maddingue commented 11 years ago

Ironically, trying to make the code work with AnyEvent bring new kinds of bugs » https://github.com/maddingue/AnyEvent-NetSNMP-agent/issues/1