traceroute: crash with ncclient timed out while waiting for an rpc reply

sincerywaing commented 7 years ago

Description of Issue/Question

when doing a traceroute, program crashes with ncclient timed out while waiting for an rpc reply Also if no-resolve can be added to the command, the efficiency could be greatly improved.

Did you follow the steps from https://github.com/napalm-automation/napalm#faq

[ *] Yes [ ] No

Setup

napalm-junos version

(Paste verbatim output from pip freeze | grep napalm-junos between quotes below)

0.11.0

JunOS version

(Paste verbatim output from show version and haiku between quotes below)

JUNOS Software Release [12.3X48-D40.5]

Steps to Reproduce the Issue

run a traceroute -

dev.open()
dev.traceroute(destination = '5.6.7.8', source = '1.2.3.4', vrf='test', timeout = '1')

Error Traceback

(Paste the complete traceback of the exception between quotes below)

---------------------------------------------------------------------------
TimeoutExpiredError                       Traceback (most recent call last)
<ipython-input-8-134650314069> in <module>()
----> 1 c = dev.traceroute(destination = '5.6.7.8', source = '1.2.3.4', vrf='test', timeout = '1')

/usr/local/lib/python2.7/site-packages/napalm_junos/junos.pyc in traceroute(self, destination, source, ttl, timeout, vrf)
   1453 
   1454         traceroute_rpc = E('command', traceroute_command)
-> 1455         rpc_reply = self.device._conn.rpc(traceroute_rpc)._NCElement__doc
   1456         # make direct RPC call via NETCONF
   1457         traceroute_results = rpc_reply.find('.//traceroute-results')

/usr/local/lib/python2.7/site-packages/ncclient/manager.pyc in wrapper(self, *args, **kwds)
    170         def make_wrapper(op_cls):
    171             def wrapper(self, *args, **kwds):
--> 172                 return self.execute(op_cls, *args, **kwds)
    173             wrapper.__doc__ = op_cls.request.__doc__
    174             return wrapper

/usr/local/lib/python2.7/site-packages/ncclient/manager.pyc in execute(self, cls, *args, **kwds)
    230                    async=self._async_mode,
    231                    timeout=self._timeout,
--> 232                    raise_mode=self._raise_mode).request(*args, **kwds)
    233 
    234     def locked(self, target):

/usr/local/lib/python2.7/site-packages/ncclient/operations/third_party/juniper/rpc.pyc in request(self, rpc)
     42         if isinstance(rpc, str):
     43             rpc = to_ele(rpc)
---> 44         return self._request(rpc)
     45 
     46 class Command(RPC):

/usr/local/lib/python2.7/site-packages/ncclient/operations/rpc.pyc in _request(self, op)
    341                     return self._reply
    342             else:
--> 343                 raise TimeoutExpiredError('ncclient timed out while waiting for an rpc reply.')
    344 
    345     def request(self):

TimeoutExpiredError: ncclient timed out while waiting for an rpc reply.

mirceaulinic commented 7 years ago

Hi @sincerywaing - this looks very familiar to me. Please let me know: are you running this on a MX80? Is 12.3X48-D40.5 the real version you're working with?

sincerywaing commented 7 years ago

@mirceaulinic yes the verion is correct and I'm using srx240.

mirceaulinic commented 7 years ago

Also if no-resolve can be added to the command, the efficiency could be greatly improved.

Are you saying that if you do no-resolve you are not seeing the same problems?

I suggest you increase the timeout optional arg to, say 120, and see what happens.

sincerywaing commented 7 years ago

@mirceaulinic adding no-resolve is just a guess, 'cause this would make things much quicker. regardless this issue, I'd suggest we add this as an option. I'll try timeout and get back to you.

mirceaulinic commented 7 years ago

adding no-resolve is just a guess,

Okay, there was a bug on MX80 routers (apparently, they don't use the same traceroute binary as MX240, MX480 or MX960) -- not sure it's the same problem here, but the symptom sounds pretty similar:

On MX80 series platform, if executing traceroutes by NETCONF which destination is unresponsive, the processes related will run forever and need to be killed from the shell, otherwise the CPU consumption might go to 100%.

This has been solved in: 14.2R8 15.1X53-D60 15.1R5 15.1F7 16.1R2 16.2R1.

Would you have other physical platforms to compare and check if that's the case you're facing, or it just takes very long to respond?

sincerywaing commented 7 years ago

@mirceaulinic I'll try srx550. meanwhile can you help evaluate the possibility to add no-resolve as kw?

sincerywaing commented 7 years ago

also add to your comment, I'm tracing a destination that is responsive... that's why weird.

mirceaulinic commented 7 years ago

meanwhile can you help evaluate the possibility to add no-resolve as kw?

At the moment it's going to be very painful to add this kwarg. But feel free to define it as an additional optional_arg.

sincerywaing commented 7 years ago

@mirceaulinic confirmed it works fine on srx550. I'll submit a pr to add no-resolve as option_arg meanwhile I tried timeout=120 and it seems work fine. Thanks @mirceaulinic !

dbarrosop commented 7 years ago

As commented in the PR, I don't think we should fix it with an optional_arg. We should either hardcode it to no-resolve or have an argument on the function itself.

For the looks of it, it's not going to solve this issue anyway.

mirceaulinic commented 7 years ago

We should either hardcode it to no-resolve

Please check the output first:

In case of success, the keys of the dictionary represent the hop ID, while values are dictionaries containing the probes results:

rtt (float) ip_address (str) host_name (str)

If we hardcode for everyone with no-resolve, ip_address and host_name will be the same for everyone, by default. Bad idea.

sincerywaing commented 7 years ago

closing this out per discussion in pr mentioned.

napalm-automation / napalm-junos