scrapli / scrapli_netconf

Fast and flexible Python 3.7+ netconf client specifically for network devices
https://scrapli.github.io/scrapli_netconf/
MIT License
93 stars 6 forks source link

JunOS VMX router NETCONF Traceback #10

Closed gwoodwa1 closed 4 years ago

gwoodwa1 commented 4 years ago

Hi,

I just tried to use scrapli netconf to connect to a JunOS router in my lab. I can exchange server capabilities but I get an XML error when doing any operations such as get_config()

lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

from scrapli_netconf.driver import NetconfScrape
import logging

logging.basicConfig(filename="scrapli.log", level=logging.INFO)
logger = logging.getLogger("scrapli")

my_device = {
    "host": "172.21.0.3",
    "auth_username": "mike",
    "auth_password": "root123",
    "auth_strict_key": False,
    "port": 830,
    "transport": "system"
}

conn = NetconfScrape(**my_device)
conn.open()
cap=conn.server_capabilities
print(cap)
result = conn.get_config()
print(cap)
print(result)
gwoodwa1 commented 4 years ago

Full traceback message below:

Traceback (most recent call last):
  File "scrapcli_test2.py", line 20, in <module>
    result = conn.get_config()
  File "/home/gary/genie/lib/python3.6/site-packages/scrapli_netconf/driver/driver.py", line 111, in get_config
    response._record_response(raw_response)  # pylint: disable=W0212
  File "/home/gary/genie/lib/python3.6/site-packages/scrapli_netconf/response.py", line 87, in _record_response
    self._record_response_netconf_1_0()
  File "/home/gary/genie/lib/python3.6/site-packages/scrapli_netconf/response.py", line 114, in _record_response_netconf_1_0
    b'<?xml version="1.0" encoding="UTF-8"?>', b""
  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1784, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
carlmontanari commented 4 years ago

Hey @gwoodwa1 !

Thanks for opening this. Would you be able to snag the log as well? If you don't want to paste it here you can slack me on ntc slack or email me. I just tested against an MX running 20.1 code and was able to snag the config w/out issue, so I'm not quite sure where to start figuring this one out, but we'll figure it out and get ya sorted!

Carl

gwoodwa1 commented 4 years ago

scrapli.log

gwoodwa1 commented 4 years ago

Thank you Carl, really like the look of using this. It was my first time experimenting with it. I've attached the log file. Let me know if I can get any more info to help understand this. Also, I tested the same with the Juniper's PyEZ to make sure it would work and it does with that one. PyEZ equivalent is below for comparision.


from jnpr.junos import Device
from lxml import etree
import jxmlease

username='mike'
password='root123'

dev = Device(host='172.21.0.3', user=username, password=password, normalize=True)

dev.open()
rpc = dev.rpc.get_config()
rpc_xml = etree.tostring(rpc, pretty_print=True, encoding='unicode')
print(rpc_xml)
carlmontanari commented 4 years ago

Thanks @gwoodwa1 will take a peak hopefully this afternoon and keep ya posted!

carlmontanari commented 4 years ago

Hey @gwoodwa1

Any chance you could crank that logging up to DEBUG and post it back?

Whats happening (I think) is that we are maybe sending an extra return somewhere or the device is not waiting for returns and just shipping its response back to us(?) -- we normally read the inputs we send off the channel, and THEN send a return. This means we can always just keep reading until the command is done returning data to us and we know all of the output we read is the response to our input... in this case it looks like we actually send the get config rpc but when we read off the channel we are already getting the response from the device:

INFO:scrapli.channel-172.21.0.3:Read: b'\n<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="101"><get-config><source><running/></source></get-config></rpc>\n]]>]]><rpc-reply xmlns:junos="http://xml.juniper.net/junos/20.1R0/junos" xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="101">

I tested manually against an mx on 20.x and it does not return data until the "enter" is sent after the rpc. I also test this regularly against a vsrx and haven't seen this behavior. So I'm hoping the debug logs will help point us in a direction to keep digging!

edit Oh - also could you confirm the versions of scrapli and scrapli netconf you have installed please? Also it may be worth a try w/ develop branch of scrapli netconf... made a few little changes in there recently -- dont think it will have an impact but cant hurt :)

Carl

gwoodwa1 commented 4 years ago

Thanks Carl, I've put the logging up to DEBUG and attached the file. Seems to be more in there. scrapli.log

I installed it today directly from github. Is there a way to get the version number? I've just installed the latest commit to be sure of the most up to date version. This has not changed the issue. Thanks

carlmontanari commented 4 years ago

Sorry I missed the question there! Yeah you can check the version number w/ pip list | grep scrapli (or python3 -m pip list | grep scrapli where "python3" is whatever you use to run stuff -- ex: python3.8 or python3.6) -- should let us know what scrapli and scrapli-netconf versions are installed. You can try to install develop branch like: pip install -e git+https://github.com/scrapli/scrapli_netconf.git@develop#egg=scrapli_netconf though I dont believe it'll make it work.

What version of junos is on that vmx, and any chance you have another Junos (or other) device handy to try it out with to see if anything new/different happens? I've found another MX w/ 18.4 and tried against that too for kicks with user auth and w/ key auth and it works as expected there too. I definitely see the problem from the logs but am just struggling to reproduce it so I can figure out what's going on :(

gwoodwa1 commented 4 years ago

Hi Carl, Thanks for checking this. I am using Junos: 20.1R1.11 VMX. I have access to other versions so I will try this with a newer version and an older one too. I also have some friends with access to physical Juniper equipment and VMXs so I'll try it there too and let you know the results.

Versions (below) scrapli (2020.9.26) scrapli-asyncssh (2020.7.4) scrapli-netconf (2020.9.23)

Thanks for your help on this. I think we'll soon get to the bottom of it once I've tried different versions of VMX. Might take me a day or two to come back with the results :)

carlmontanari commented 4 years ago

Awesome you rock sir!! Yeah, we'll figure it out one way or the other!

You could try it w/ asyncssh to see if that works any differently... In theory this should give us basically the same result... if that is the case then I'm kinda wondering if that vmx is behaving badly, if it works, then I think I have a difficult to track down bug somewhere in the system transport stuff for netconf (difficult just since I cant reproduce yet). Here is a little example that should work for ya:

import asyncio
import logging

import uvloop

from scrapli_netconf.driver import AsyncNetconfScrape

logging.basicConfig(filename="scrapli.log", level=logging.INFO)
logger = logging.getLogger("scrapli")

my_device = {
    "host": "172.21.0.3",
    "auth_username": "mike",
    "auth_password": "root123",
    "auth_strict_key": False,
    "port": 830,
    "transport": "asyncssh"
}

async def main():
    conn = AsyncNetconfScrape(**my_device)
    await conn.open()
    print(conn.server_capabilities)
    config_result = await conn.get_config()
    print(config_result.result)

uvloop.install()
asyncio.run(main())

Thanks again for sticking with me on this!

gwoodwa1 commented 4 years ago

Hey Carl, Thanks buddy..!! This has really grabbed my attention now. I've gone to 18.3R3.8 and used your asyncio script and it works beautifully so long as I change to TCP Port 22 instead of TCP Port 830. Now the interesting thing is I have TCP 830 configured for NETCONF in the JunOS config and there are no firewall filters. I'll do some more testing with the different versions using my old script and the new one you provided. Hopefully it will pinpoint a pattern, it might just be my setup at the end of the day. My friend has tried in on a physical SRX today and all works and he was also going to try it on his VMX later. In summary, I'll do some more testing against old script/new script and different JunOS VMX versions. At the very least it is working using asyncio and TCP Port 22. Thanks again, Gary

carlmontanari commented 4 years ago

Progress! :D

Would be interesting to see/hear if port 22 w/ system transport works ok... The fact that the asyncssh transport plugin works seems to me to indicate that I have a bug somewhere with the system transport, but the fact that it works with my test devices and your friends SRX as well maybe means that VMX is just behaving... strangely? (I hate "strange"... I def wanna know what's happening!!)

In any case, glad we are looking up -- thanks for all the troubleshooting!

Carl

gwoodwa1 commented 4 years ago

The first proper round of testing on Junos: 18.3R3.8 VMX using TCP Port 22 (TCP 830 non-working but that could be something in the device configuration) original script (i.e. non asyncio) - Capabilties works but get-config operation produces the same traceback error asyncio script - All working for both Capabilities and get-config

gwoodwa1 commented 4 years ago

On Junos: 20.3R1.8 VMX using TCP Port 22 (TCP 830 non-working but that could be something in the device configuration) original script (i.e. non asyncio) - Capabilties works but get-config operation produces the same traceback error asyncio script - All working for both Capabilities and get-config

gwoodwa1 commented 4 years ago

The behavior seems to be consistent across different versions of JunOS VMX. I just need to ask a friend if he's tried on a VMX too and what he sees happen. I am using a docker container for the Junos but again I can't see if that would be a factor here.

carlmontanari commented 4 years ago

Thanks for the details Gary -- trying to get my hands on VMX download so I can reproduce and troubleshoot... will keep you posted!

gwoodwa1 commented 4 years ago

Thanks Carl, I can also try and see if there is a way I can debug properly on the Juniper.

carlmontanari commented 4 years ago

Wow... this was a fun one, but I think I've tracked down what's going on... It is my understanding that an rpc gets sent to the device, and that nothing "happens" until a return character is sent (after sending the rpc). In my testing with Cisco IOSXE and IOSXR, and with Junos on vSRX (and real qfx/mx devices), and in some other folks testing with Nokia devices this has proven to either 1) be true, or 2) has never been a problem due to luck/cosmic rays/whatever.... but on the vmx (I did find an eval copy not behind juniper customer portal login thing) this is not the case!

You can probably test this out yourself pretty easily too:

SSH to the device as scrapli does:

ssh 172.18.0.11 -p 22 -l vrnetlab --s netconf

(with your user of course) -- you'll end up getting greeted with the server capabilities, then it is our turn to send capabilities:

<?xml version="1.0" encoding="utf-8"?>
    <hello xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
        <capabilities>
            <capability>urn:ietf:params:netconf:base:1.0</capability>
        </capabilities>
</hello>]]>]]>

Next we can request the get config rpc:

<?xml version='1.0' encoding='utf-8'?>
<rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="101"><get-config><source><running/></source></get-config></rpc>
]]>]]>

^ is how scrapli sends the rpc... however it seems that the MX thinks those line breaks means GOGOOGOGO and it starts sending the config straight away... this breaks scrapli basically because we expect to be able to read the input we sent to the device off the channel before we hit return (this also "drains" the inputs we write off the channel so we dont have to deal w/ stripping it out/parsing it/whatever).

If you send the above you should basically see the config returned before you can blink (at least w/ my vmx!)...

Try again, but this time without any return chars in the rpc:

<?xml version='1.0' encoding='utf-8'?><rpc xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="101"><get-config><source><running/></source></get-config></rpc>]]>]]>

Until you hit return nothing should happen... as it does on all the other boxes ive tested with! Hit return and bam, of course things work as expected!

So.... to confirm this I wonder if you could test out a very simple little tweak to the base_channel... to find that file to edit you can do:

python -c "from scrapli_netconf.channel import base_channel; print(base_channel.__file__)"
/SOMEPATH/scrapli_netconf/channel/base_channel.py

Line 47 should look like this right now:

return channel_input

Can you try replacing that line with:

return channel_input.replace("\n", "")

Then we can see if that sorts things out.... why this did not seem to cause any issues with asyncssh at this point I have zero idea... perhaps it is faster, or the backend channel reading/writing is different (its been too long for me to recall off the top of my head hah).

Let us know if that gets things looking happier!

Carl

gwoodwa1 commented 4 years ago

Wow, you've cracked it..!!! Totally makes sense in terms of the logic. I tried the SSH/NETCONF manual method and sure enough it will sit there until you hit the carriage return (as you explained it.) I then made the small tweak to _basechannel.py which has stopped the traceback. However, instead of the get_config result, it came back with Response I then made a another small tweak to use _print(conn.getconfig().result) .Now I can see it print the get_config output as per the asyncio script. All working now, thanks very much for your help. I was worried it was something weird with my VMX setup.

carlmontanari commented 4 years ago

🔥 🔥 🔥 🔥 🔥 🔥 🔥 🔥 Awwwwwwww yeah!!!! Phew... that is great!! Thought I was going crazy there for a minute :)

I will need to test this out a bit more before getting it pushed up to develop and further... I'll leave this open until then so I can let ya know easily when things are happening. Hopefully that little patch up will hold ya over till then.

Oh, PS - the reason the async bits work is down to how the underlying asyncssh connection is handled where there is basically a stdin and stdout channel, so the output from the device getting sent "early" was never mixing with our command input and thus not causing us any issues!

Thanks a ton for helping work through this one Gary, this was a fun one :D

Carl

gwoodwa1 commented 4 years ago

Just to confirm the fully working script for non-asyncio:

from scrapli_netconf.driver import NetconfScrape
import logging

logging.basicConfig(filename="scrapli.log", level=logging.INFO)
logger = logging.getLogger("scrapli")

my_device = {
    "host": "172.18.0.2",
    "auth_username": "mike",
    "auth_password": "root123",
    "auth_strict_key": False,
    "port": 22,
    "transport": "system"
}

conn = NetconfScrape(**my_device)
conn.open()
cap=conn.server_capabilities
print(cap)
result = conn.get_config().result
print(result)
gwoodwa1 commented 4 years ago

I am so happy you nailed it Carl, I agree it was great fun getting to the bottom of it. I would like to thank you for being so helpful and supportive. Normally, I am fixing networks issues and this is not nearly as much joy as doing this.

carlmontanari commented 4 years ago

Perfect! Yep -- normally you'll probably want to get the result like:

result = conn.get_config()
print(result.result)

As there are other "goodies" in that result object that you may want to investigate -- stuff like start/end/elapsed time, channel input, xml output (as an lxml Element) -- there is also a handy failed attribute (result.failed) and a raise_for_status method that can raise an exception if the rpc failed!

No problemo Gary, was a good adventure :)!

carlmontanari commented 4 years ago

@gwoodwa1 if you get a chance, would you mind testing the current develop branch to see if it fixes this issue. You can install like so: pip install -e git+https://github.com/scrapli/scrapli_netconf.git@develop#egg=scrapli_netconf

hopefully we should be good to go with that little tune up -- testing against IOSXE/IOXSR/Junos seems to be all good with this!

gwoodwa1 commented 4 years ago

Hi Carl, Just tried it quickly and all working with the new branch on JunOS.

carlmontanari commented 4 years ago

Boom! Thanks Gary! I'm going to go ahead and close this one out now... I'm hoping to check a few more things before merging to master and making another release, so that may be another week or two, but hopefully develop will hold ya over for now!

Thanks again for the help with this!

Carl