hellt / vrnetlab

Make VM-based Network OSes run in Containerlab
https://containerlab.dev
MIT License
129 stars 88 forks source link

Replace telnetlib with scrapli #279

Open kaelemc opened 2 weeks ago

kaelemc commented 2 weeks ago

This PR replaces the usage of telnetlib in vrnetlab with scrapli.

Changes

Functions

read_until()

telnetlib had a read_until() function which was used by wait_write. This has been re-implemented under the VR class.

Args:

Returns: All console output that was read up until the match, or function timeout.

wait_write()

Adapted to scrapli, functionality should be analogous to the telnetlib version.

expect()

telnetlib had the expect() function which was used in the launch.py entrypoint. This new adapted version in the VR class should function the same as the telnetlib one.

Args:

Returns: a tuple of:

print()

The VR class now has a print function which is used to simply write to stdout and instantly flush. It's used so that the telnet console can output nicely on docker logs.

If the console output was printed via the logger, the formatting would make it difficult to interpret the output.

Args:

Usage

Relevant changes to make a node work:

Other

I don't expect other nodes to 'plug and play' and work perfectly from the get go, I'm sure I have made errors or overlooked something so some work is required on the other nodes to ensure functionality.

Collaboration from others is required so we can confirm all nodes work reliably (as well as to fix some other possible issues in the way i've implemented the changes). I'm open to feedback 😊.

Confirmed working platforms

subdirectory names are in brackets. Follows alphabetical/subfolder order of the directory tree.

kaelemc commented 2 weeks ago

It might be worth making another scrapli branch and then I can change this PR to merge into that branch (instead of master).

This way other contributors can submit PRs for any changes to other nodes and once all/enough nodes work, that branch can be merged into master?

hellt commented 2 weeks ago

sound idea @kaelemc I have created the scrapli-dev branch

kaelemc commented 2 weeks ago

IOS-XE nodes (CSR1000V, Cat8kv and Cat9kv) the Scrapli XE driver is now being used. Had some issues with it working with IOS-XE 16.x but it was an error on my end.. got the configs (both bootstrap and startup) to work nicely.

kaelemc commented 2 weeks ago

Maybe this belongs as it's own PR but I've added an xrv_qemu directory. It is the same as XRv but with some modifications to make XRv work when the user provides the qcow2 image.

The existing xrv directory requires vmdk images, most users who import images from CML or other places on the web most likely will get a qcow2, this makes it easier for users so they don't have to mess around with qemu-img. (even the vmdk I have has been converted from a qcow via qemu-img)

I tried one more time to use the scrapli IOSXRDriver with XRv but it still seems unreliable, and I get some weird behavior from the XRv (don't think this is scrapli's fault).

jbemmel commented 2 weeks ago

Not a trivial change, but I would suggest to create 1 base Dockerfile for all of vrnetlab, and then derive all platform images from that

It doesn't make sense to do 100x "install scrapli" in all those separate files - that becomes unmanageable

hellt commented 2 weeks ago

yeah we can definitely create the base image hosted on ghcr.io with base pacakages that every vr system relies on

kaelemc commented 2 weeks ago

@jbemmel Excellent idea, certainly is/was unmanageable for me

kaelemc commented 1 week ago

So this is a difficult one to balance, in labs of different sizes the nodes take varying times to boot (depending on system specs etc.).. I'm wondering if we should bump all the scrapli timeouts to something very large to make the connection timing out a non-issue?

Currently I have XE devices on 10 minutes, I have a feeling for people with lower-spec systems might hit time-outs meaning they could never boot the device. Should we increase to something very large just to be safe?

The base Scrapli Driver (underlying telnet connection for wait_write, expect and read_until functions) uses a timeout of 1hr.

hellt commented 1 week ago

I agree, let's have a long enough timeout so that you don't have to guess the time down to minutes.

We should leave the timeout configurable via the env var setting that can be then put in the topology file if needed to tune it up

kaelemc commented 1 week ago

Ok 👍, having timeouts as an env var is a good idea

kaelemc commented 1 week ago

I had to remove NETCONF configuration on CSR as different versions of IOS-XE 16.x and 17.x have different behaviors when this command is entered in the config.

In 16.x a prompt will appear asking for some confirmation about the NETCONF configuration. I figured it's best to just remove this instead of adding extra complexity to handle the prompt for that single command.

Thoughts?

kaelemc commented 1 week ago

SROS seems to be working fine with minimal changes.

I just had to comment out the defaulting of the clean_buffer=true in wait_write() (since this is not supported anymore). If I run into issues I'll implement the functionality back in.

Also removed the env var printing from the SROS side, This is done as part of vrnetlab now.

image

kaelemc commented 4 days ago

One other thing I noticed and haven't really been able to crack (granted I haven't invested too much time into it).

With telnetlib it seemed like we got the raw bytes out of the console. When booting images you'd see the menus and console outputs as if you were directly consoled into the node. There were other things like the console outputs being coloured on some nodes which did that.

This is not the case with Scrapli. The boot menu is not displayed like it was with telnetlib and on nodes like XRv9k we've lost the coloured output.

I don't think this is a case of ANSI codes not being recognised by the logger or my terminal. Not sure exactly what is going on with that.

This seems to be purely a cosmetic thing. In my view it is pretty much irrelevant, but thought I might just put the information out there.

hellt commented 4 days ago

could be that scrapli strips ansi chars by default maybe @carlmontanari knows/rememembers if there is a way to have a clear channel when doing telnet-like bits

kaelemc commented 4 days ago

@hellt If you have any spare time, kindly could you get a 'master' image on ghcr up and running. I think it'd be good to continue further development using the image.

I'll handle the work of updating all the Dockerfiles :).

I think you can pretty much copy what the SROS Dockerfile installs, add genisoimage since it looks like Cisco and Juniper images are using that.

Some images have OVS installed as well, although doesn't seem to be used.

kaelemc commented 4 days ago

Of course install pip and scrapli too! (via the git source is preferred for now, just like the current Dockerfiles) :)

hellt commented 4 days ago

@kaelemc please try ghcr.io/srl-labs/vrnetlab-base:0.0.1. I think it has all the deps required for all the images. The dockerfile for this one is added in 13b3cc6

carlmontanari commented 3 days ago

there is no way to not strip ansi using "normal" scrapli stuff, but you could easily replace the channel.read method w/ your own that does not strip if you were so inclined. but -- you will break all the "normal" scrapli behavior if there is ansi stuff (because it would cause failed matches on prompts/inputs and stuff). I dont think anyone has ever not wanted to strip it or ask about it so this is a first :P

kaelemc commented 3 days ago

@carlmontanari Thanks for replying, yea I guess the usage here in vrnetlab is a little unorthodox 😄 but that's fine. Personally I don't think it's too much of an issue anyway, certainly not worth it if we lose behaviour from Scrapli.

Would it be ok if you could cut a new release of Scrapli? https://github.com/carlmontanari/scrapli/pull/355 fixes an issue I was running into on a lot of platforms but the last release was from July. Ideally of course we'd want to pin the containers to a specific release :).

@hellt Thx, the base image is working great. Just if you could install the genisoimage pkg that'd be great, a few of the Cisco + Juniper platforms rely on it.

hellt commented 3 days ago

@kaelemc added genisoimage and reuploaded 0.0.1 image

kaelemc commented 2 days ago

@hellt Thanks, I guess one more thing is if you could just change Scrapli to install from the latest changes via:

pip install git+https://github.com/carlmontanari/scrapli --break-system-packages

Otherwise everything is broken and won't work.. (seems to be a random EOF sent at some point from either qemu or the node itself and Scrapli will think the connection hasn't been opened correctly)

If/when a new release comes out with the fixes then we can pin the version to that release :)

carlmontanari commented 1 day ago

@kaelemc 2024.7.30.post1 just pushed to pypi 🫡

hellt commented 1 day ago

thanks Carl @kaelemc I have reuploaded the base image with the new version

kaelemc commented 1 day ago

@carlmontanari @hellt Thanks!

kaelemc commented 1 day ago

@hellt New image (& scrapli release) are all working great, i've moved Cisco nodes over so far and tested all of them. Install time is much better without having to re-install all the packages (even better when everything is cached locally of course 😄).

I have the Dockerfile changes ready for all the other nodes but I want to test as much as I can first to see if any key packages are missing, any incompatibilities etc.

In terms of Dockerfiles, since the MAINTAINER tag is now deprecated, should I migrate those over to labels or remove them entirely, it seems some of these may have been copied over from the original vrnetlab so possibly those users aren't active, or contributing to hellt/vrnetlab? I'm also seeing inconsistencies in how the maintainer labels are handled (LABEL maintainer vs LABEL org.opencontainers.image.authors) and only a few Dockerfiles have these labels anyway.

and in the recent commits i've just pushed you'll see on XRv9k I've switched it over to pull vcpu and ram from the env vars directly instead of via the flags/argparser. I figured if we want vrnetlab to be the 'source of truth' for those things (discussed in #2285.) there's no need to have the clab end to contain logic to pass vcpu/ram via the flags... do you agree with this approach, am I clouding this PR with too much other stuff, what are your thoughts?

I've also implemented it this way with the cat8kv (for some reason there was no vcpu/ram knobs implemented).

hellt commented 1 day ago

you can remove the maintainer label/field altogether. It is not up to date anyways.

Yes, correct, the defaults for the cpu/mem should stay within the vrnetlab node definition, and containerlab can override this via QEMU

kaelemc commented 1 day ago

you can remove the maintainer label/field altogether. It is not up to date anyways.

👍

Yes, correct, the defaults for the cpu/mem should stay within the vrnetlab node definition, and containerlab can override this via QEMU

Alright, currently I have it set to use env vars called VCPU and RAM as that's what I've generally always used (it's what containerlab is using. It's also in the xrv9k docs in the 'resource requirements' admonition.

Didn't know about the QEMU_SMP/QEMU_MEMORY but I can easily changeover if you think this is the preferred naming?

hellt commented 1 day ago

it is not only the preferred naming, but also is generically loaded in the vrnetlab common

The reason there are VCPU/MEM dangling in the code base is pure legacy. We should converged on the global QEMU_* env vars as they are read once by every node, without duplication

kaelemc commented 1 day ago

Oh perfect, got it. Will make the changes 👍

kaelemc commented 9 hours ago

Will convert as many of the nodes to use the Scrapli community drivers as possible.

I've added a checklist in the original PR comment of nodes I think are 'working' for sure.

Every node is using the new base vrnetlab image, will have to iron out any missing packages or incompatibilities there.

All Cisco nodes have default SMP/memory and I have tested that they are infact overriden correctly using the respective env vars as pointed out. I had to make a minor change for XRv9k as without the explicit single socket config it did nothing but kernel panic.

This means like other nodes, for XRv9k users can just set the QEMU_SMP to the number of vCPUs they want instead of QEMU_SMP: cores=x,threads=1,sockets=1. But if they want to supply it like this they still can.