sysml / clickos

The Click modular router: fast modular packet processing and analysis
http://www.read.cs.ucla.edu/click/
Other
136 stars 35 forks source link

Unable to start the Mazu-NAT configuration in ClickOS #16

Closed stevepp closed 8 years ago

stevepp commented 8 years ago

Hi, I am testing the Mazu-NAT configuration in ClickOS. My Domain-0 is Ubuntu 14.04. The ClickOS VM is configured with two VIFs: eth0 and eth1. I have created two bridges xenbr0 and xenbr1 by editing the /etc/network/interfaces file in Domain-0. Basically, eth0 in ClickOS is connected to xenbr0, which further connects to the outside; eth1 is connected to xenbr1 which connects to a back-end VM on the same host.

After creating ClickOS VM, I use cosmos to pass Mazu-NAT.click to it and got it running. However, it seems that the VM is still in blocked state. In the console output, there is no click router related output, which I guess the click router is not started at all.

Does anyone have any experience making Mazu-NAT working in ClickOS? Thanks.

fmanco commented 8 years ago

Hi

Can you post the output of xl console clickos?

stevepp commented 8 years ago

Hi, here is the output, there is no router related output. I think the router is not started actually. Have you been successful on testing Mazu-NAT or other simpler NAT in ClickOS?

root@Xen-Dom0:/etc/xen# xl console clickos Xen Minimal OS! start_info: 0x2ed000(VA) nr_pages: 0x800 shared_inf: 0x83d42000(MA) pt_base: 0x2f0000(VA) nr_pt_frames: 0x5 mfn_list: 0x2e9000(VA) mod_start: 0x0(VA) mod_len: 0 flags: 0x0 cmd_line: stack: 0x241f40-0x261f40 MM: Init _text: 0x0(VA) _etext: 0xecd96(VA) _erodata: 0x16a000(VA) _edata: 0x16bb98(VA) stack start: 0x241f40(VA) _end: 0x2e8010(VA) start_pfn: 2f8 max_pfn: 800 Mapping memory range 0x400000 - 0x800000 setting 0x0-0x16a000 readonly skipped 0x1000 MM: Initialise page allocator for 2fa000(2fa000)-800000(800000) MM: done Demand map pfns at 801000-2000801000. Heap resides at 2000802000-4000802000. Initialising timer interface Initialising console ... done. gnttab_table mapped at 0x801000. Initialising scheduler Thread "Idle": pointer: 0x2000802050, stack: 0x310000 Thread "xenstore": pointer: 0x2000802800, stack: 0x320000 xenbus initialised on irq 1 mfn 0x41c5af Thread "shutdown": pointer: 0x2000802fb0, stack: 0x330000 Dummy main: start_info=0x261f40 Thread "main": pointer: 0x2000803760, stack: 0x340000 sparsing 0MB at 17e000 "main"

fmanco commented 8 years ago

Indeed the configuration is not being started. Please tell me the sequence of commands you're using and the output of xenstore-ls after starting the configuration.

Thanks

stevepp commented 8 years ago

I have started ponger.click successfully following the instructions from the ClickOS website. For Mazu-nat, I have done the following steps.

First, I created the ClickOS VM: xl create mazu-nat.cfg. Here is the mazu-nat.cfg file:

name = 'clickos' kernel = '~/clickos/minios/build/clickos_x86_64' vcpus = '1'

pinning your VCPU helps performance

#cpus   = '3'
memory = '8'
vif    = ['mac=00:15:17:15:5d:74,bridge=xenbr0', 'mac=00:15:17:15:5d:80,bridge=xenbr1']
on_poweroff = 'destroy'
on_reboot   = 'restart'
on_crash    = 'preserve'                        
click       = 'mazu-nat.click'

Then I have done the following steps: DOMID=xl list | grep click0 | awk -F' ' '{ print $2 }' xenstore-write /local/domain/$DOMID/clickos/0/config/0 "cat mazu-nat.click" xenstore-write /local/domain/$DOMID/clickos/0/status "Running"

Here is the output of xenstore-ls.

vm = "/vm/415da112-373b-48bc-b387-8bee048b3964" name = "clickosproxy" cpu = "" 0 = "" availability = "online" memory = "" static-max = "8192" target = "8193" videoram = "-1" device = "" suspend = "" event-channel = "" vif = "" 0 = "" backend = "/local/domain/0/backend/vif/4/0" backend-id = "0" state = "1" handle = "0" mac = "00:15:17:15:5d:74" 1 = "" backend = "/local/domain/0/backend/vif/4/1" backend-id = "0" state = "1" handle = "1" mac = "00:15:17:15:5d:80" control = "" shutdown = "" platform-feature-multiprocessor-suspend = "1" platform-feature-xs_reset_watches = "1" data = "" domid = "4" store = "" port = "1" ring-ref = "4302255" console = "" backend = "/local/domain/0/backend/console/4/0" backend-id = "0" limit = "1048576" type = "xenconsoled" output = "pty" tty = "/dev/pts/11" port = "2" ring-ref = "4302254" clickos = "" 0 = "" config = "" 0 = "//mazu-nat.click\n\n// ADDRESS INFORMATION\n\nAddressInfo(\n intern \t10.0.0.2\t10.0.0.0/8\t00:15:17:15..." status = "Running"

fmanco commented 8 years ago

Is mazu-nat.click bigger than 4KB? This xenstore-write method to start a configuration only works for smaller configurations. If the configuration is too big, it needs to be split into multiple entries (config/0, config/1, etc.) due to xenstore limitation of the size of the entries. Cosmos (https://github.com/cnplab/cosmos) can do that for you.

stevepp commented 8 years ago

I have deleted all the comments and the file size is 3.3K which is smaller than the size limit of xenstore-write. I will try to use cosmos and see what happens. Thanks.

stevepp commented 8 years ago

Using cosmos I still met with the same problem. Does ClickOS support multiple NICs when doing NAT? I have noticed a similar thread talking about deploying a firewall configuration with two NICs. The proposed solution does not work either.

fmanco commented 8 years ago

I don't think the problem has anything to do with the actual configuration, otherwise you would be seeing some errors when click tries to instantiate the router. I'm not seeing any output at all, which tells me the process of instantiating a configuration inside ClickOS is not even starting.

Would you please point me to the actual configuration, so that I could make some tests?

To clarify, using the same image you can start a ponger configuration but not mazu nat, correct?

stevepp commented 8 years ago

Yes, use the same kernel image that works when starting pinger. Use mazu-nat directly without any change. Please let me know the results after you give it a try. Thanks for all your comments.

fmanco commented 8 years ago

Does start the configuration kills your xenstore? On some of my test systems xenstore gets stuck waiting for an evtchn, doesn't trigger the watch and therefore clickos doesn't parse the configuration. On some others it looks fine.

Can't explain it yet, and don't have a lot of free time atm. We'll take a look as soon as possible.

stevepp commented 8 years ago

Okay. I will take a look at the implementation details. The main logic is located in the click.cc file.

fmanco commented 8 years ago

Yes, main logic in minios/click.cc and to/fromdevice in elements/minios/.

Thanks for looking into this.

stevepp commented 8 years ago

One more thing, is there any configuration consisting of multiple NICs that you have tested successfully?

fmanco commented 8 years ago

Yes, we've successfully tested multiple configurations involving 2 or more VIFs.

stevepp commented 8 years ago

Interesting. Do you mind letting me know what configurations you tested or whether mazu-nat can be started?

On Jan 28, 2016, at 5:57 PM, Filipe Manco notifications@github.com wrote:

Yes, we've successfully tested multiple configurations involving 2 or more VIFs.

— Reply to this email directly or view it on GitHub.

fmanco commented 8 years ago

If you want to test 2 VIFs you can try something as simple as FromDevice(0) -> ToDevice(1), this should work without a problem. We tested more complex stuff with more than 2 VIFs, but I can't give the actual configurations.

As for the mazu-nat I got up to the point where ClickOS complaints about errors in the configuration. Stuff like wrong device names (in ClickOS devices don't have names, so stuff like eth0 is invalid). I didn't actually fix theses errors, but once you get to this point it should be easy to fix.

fmanco commented 8 years ago

Please paste here your (xen) configuration file. Also the output of xenstore-ls after starting the configuration.

stevepp commented 8 years ago

I made a mistake in the xen domain configuration file. It is resolved now. So still I am having problem starting mazu-nat configuration, where the router thread is not even created. But the strange thing is that for the trivial configuration FromDevice(0) -> ToDevice(1) with two interfaces, everything works fine.

stevepp commented 8 years ago

Hi Filipe,

How did you create and configure two bridges xenbr0 and xenbr1 in Domain 0? Since I am under wifi environment, I just created two host only network by editing /etc/network/interfaces file in Ubuntu. Here is the configuration:

interfaces(5) file used by ifup(8) and ifdown(8)

auto lo iface lo inet loopback

auto xenbr0 iface xenbr0 inet manual pre-up brctl addbr xenbr0 up ip link set xenbr0 up post-down brctl delbr xenbr0 down ip link set xenbr0 down

auto xenbr1 iface xenbr1 inet manual pre-up brctl addbr xenbr1 up ip link set xenbr1 up post-down brctl delbr xenbr1 down ip link set xenbr1 down

When the host is booted, xenbr0 and xenbr1 are assigned different mac addressed. However, when I start the ClickOS configurations, their mac addresses are both changed to "fe:ff:ff:ff:ff:ff". Not sure whether this is what causes the problem.

fmanco commented 8 years ago

Not sure what is your problem, the bridges seam fine. As for the MAC address, ClickOS (or Click for that matter) doesn't have a MAC address, you're handling raw packets on which you can set whatever MAC address suits you depending on the use case. The ClickOS instance might not be addressable at all.

stevepp commented 8 years ago

Just want to update the debugging process. When I tried to write the mazu-nat.click file into the xenstore entry as follows:

xenstore-write /local/domain/$DOMID/clickos/0/config/0 "cat mazu-nat.click"

I can see from the ClickOS console output that the xenbus_read function in the main loop of click.cc file got stuck waiting. It is really strange that* xenbus_read* got stuck for mazu-nat but it worked correctly for ponger.click.

fmanco commented 8 years ago

Indeed it's weird. I see one of two problems here:

I would test with a different configuration with similar size, or with increasingly big configurations. You don't need to use valid configurations (of course click will then fail but the purpose here is to get to that stage). Let me know...

fmanco commented 8 years ago

FYI just released a ClickOS Control to manage the routers in a ClickOS instance. It's still not complete but should be able to create, start, stop and destroy routers. Please give it a try, might solve your issues.

stevepp commented 8 years ago

Thanks. I'll give it a try and post the results later. On Mon, Feb 8, 2016 at 3:58 AM, Filipe Manco notifications@github.com wrote:

FYI just released a ClickOS Control https://github.com/cnplab/clickos-ctl to manage the routers in a ClickOS instance. It's still not complete but should be able to create, start, stop and destroy routers. Please give it a try, might solve your issues.

— Reply to this email directly or view it on GitHub https://github.com/cnplab/clickos/issues/16#issuecomment-181260057.

stevepp commented 8 years ago

Finally, I got Mazu-nat.click running without errors. However, the configuration does not work properly. In the configuration, all connection requests from the external world (ssh, http, ftp, etc) will be redirected to the internal server. I tested ssh and http and the connections could not be established. By running tcpdump on the internal server, I could see that there is a TCP SYN packet towards port 22 for ssh. BUT the sshd did not reply at all such that the three-way TCP handshake can not be finished.

I am not sure whether ClickOS made some modifications to the packet so that the internal server identified this and refused to the connection request. It would be great if you can give me some advice on debugging this.

fmanco commented 8 years ago

Hi

Great you got it working. I'm curious what was the problem (and the solution), can you explain?

To debug this I would start by doing two captures, with and without the nat in between and analyse the differences between the SYN packets in each of the captures. Also just double check there isn't any dumb thing like wrong port numbers, IPs, etc. on the SYN packet.

Let me know how it goes.

stevepp commented 8 years ago

Actually I just recompiled cosmos and use it instead of xenstore-write to pass the configuration into clickos. Then it works.

I didn't find any difference between the TCP syn packets except for the source and destination IP addresses and ports due to NAT. I also print out the packet contents from the ClickOS console. I can see that the IP addresses and ports are translated correctly. But still the internal server see this translated packet but made no response at all.

stevepp commented 8 years ago

Just a quick question. Have you ever tested running the real world servers (e.g. OpenSSH, Apache http, ftp server) behind an NAT box (like mazu-nat)?

fmanco commented 8 years ago

No we didn't. However this doesn't sound like a ClickOS problem. If the packet is being properly translated and you can't find differences (besides the expected ones) between the two cases, it needs to be something on the server.