unias / docklet

OS for Virtual Private Cloud
https://unias.github.io/docklet/
BSD 3-Clause "New" or "Revised" License
240 stars 46 forks source link

VLAN & VxLAN #69

Closed leebaok closed 8 years ago

leebaok commented 8 years ago

https://www.arista.com/assets/data/pdf/Whitepapers/Arista_Design_Guide_DCI_with_VXLAN.pdf

a good intro for vxlan in datacenter

jklj077 commented 8 years ago

report for duty, REN Xuancheng

jklj077 commented 8 years ago

The following is about reusing vlan ids. Currently, 128 vlan ids have been used as shared vlan ids. The codes suggest these ids have been reserved for "system use". Is it necessary to reuse these vlan ids?

It seems the system doesn't record the mapping from vlan id to user/users (whether the vlan id is shared or not.) For vlan ids which are not shared, it's not a problem at all. However, if we would like to reuse the shared vlan ids, it may need to at least record how many times those vlan id has been acquired.

jklj077 commented 8 years ago

The following is about vxlan. This post maybe helpful http://www.cnblogs.com/sammyliu/p/4627230.html

And just a thought, is it possible to use gre tunnels to "isolate" subnet? As gre tunnels do have tunnel ids.

leebaok commented 8 years ago

@jklj077

The sharing status of user vlan id is up to user group in our current source code. ( see line 68 of src/vclustermgr.py)

For shared vlan id, we do not record the times it used now. So, we do not reuse shared vlan id now. ( Maybe we reuse it later ...)

So, you just need to reuse the vlan ids that are not shared.

It is simple. Right ? Go ahead !

leebaok commented 8 years ago

Using gre tunnels with ids to isolate subnet

Does gre tunnels with ids mean vxlan ?

So you mean that every user has his own switch on every physical node ? Yeah, it is a way !

I don't know which is good, one switch on one host for all users or each switch for each user on one host.

You can try it and then we can test the performance.

jklj077 commented 8 years ago

If so, the reusing of vlan ids is rather simple. I have coded, and it need more testing.

I think maybe in the current scenerio, gre tunnels with tunnel ids are equivalent to vxlans with vn ids. I will investigate more.

ps: the @ doesn't seem to work.

leebaok commented 8 years ago

@jklj077

Yes, reusing of unshared vlan ids is simple. Waiting for your work !

ps: the @ in my browser works well.

jklj077 commented 8 years ago

I have tried one switch per user with 2 hosts, each of which hosts 2 containers. It did work!

However, the network interfaces increase. one vs per user

Bridge user1-vs
    Interface user1-vs
    Interface user1-gre or vxlan-user1
    Interface user1-container1

Bridge user2-vs
    Interface user2-vs
    Interface user2-gre or vxlan-user2
    Interface user2-container1

compared to one vs per host

Bridge docklet-br
    Interface docklet-br
    Interface worker1-gre
    Interface user1-container1
    Interface user2-container1

The former isolates user network completely, which may enable user's defining his own network. I don't have a clear view of how the performance would be. How can we test?

For "the @ doesn't seem to work", I meant I didn't get notified.

leebaok commented 8 years ago

Great Work

I think maybe I pushed myself in a narrow situation before -- one switch on one host. But it is not necessary.

Each switch for each user working with VxLan leads to a much simpler solution for network isolation.

vxlan

( picture from http://www.cnblogs.com/sammyliu/p/4627230.html, the blog you mentioned above )

Now I open my mind and I think mapping vlan tag to vxlan id is easy. Vlan is on the second network level and vxlan is on the third network level. They work on different parts. So we just need to set vlan tag on the container port and vxlan port and then the vlan tag is mapped to vxlan id automatically.

vxlan-2

( picture from http://www.ainoniwa.net/pelican/2014/0509a.html)

In the picture above, we set vlan-tag=10 on eth1 and vxlan0 on the left host and then :

So, if we set vlan tag corrently, the mapping is automatic.

( The notification email sometimes works while sometimes not works. Who knows ! )

jklj077 commented 8 years ago

Nice and Neat solution!

The network setup from http://www.ainoniwa.net/pelican/2014/0509a.html is enlightening.

There are some more details to be discussed, though:

Maybe we can discuss it in person. However, May 3rd, where is our next scheduled discussion, is a school holiday.

leebaok commented 8 years ago

My last comment is about my thinking and understanding of vlan, gre, vxlan and the relationship of them. It is not the solution of docklet network stack.

So, we need further talk about docklet network. And I agree discussion in person is a good way.

Did you mean May 4th? Our next scheduled discussion? I am ok with the time.

jklj077 commented 8 years ago

Sorry for the misunderstanding also for the wrong date. Yeah, I mean May 4th. If the time is okay, I think we'll meet as usual then.

leebaok commented 8 years ago

@jklj077

Hi, my laboratory is in 1616, No.1 Science Building. Could you please come here for our discussion?

Thank you.

jklj077 commented 8 years ago

No probelm

jklj077 commented 8 years ago

with the aforesaid design, there are several major changes

userManager

ovscontrol

netcontrol

UserPool

NetworkManager

Worker

VClusterMgr

leebaok commented 8 years ago

@jklj077

Wow, it's so great. Go ahead !

liwyNo commented 8 years ago

report for duty, Wangyang Li

jklj077 commented 8 years ago

instead of making radical changes to the current network design, a mild extension may be implemented.

the major problem here is that the count of vlan networks is constrained to 4094, which is inadequate for the docklet system. we need a relatively large count of networks. vxlan is a choice. however, the count of vxlan networks is too large. an efficient management of its dynamic allocation is a hard task itself.

as a L2 technique, we can and only can isolate vlan networks with switches. that is, we can setup two switch s1 and s2, each of which has its own subdomain, so that vlan network tagged 1 subject to s1 will not be on the same vlan with vlan network tagged 1 subject to s2, as long as s1 and s2 are not connected by links.

a simple illustration may be like this:

br1
    interface br1
        type internal
    interface c1
        tag=1
br2
    interface br2
        type internal
    interface c2
        tag=1

in that configuration, we cannot ping from c1 to c2 or from c2 to c1, even if they are in the same subnet defined by ip address(, of course).

( the aforesaid configuration is only used to show the isolation. in practice, there will be gateways: every device can ping to every device. e.g.:

br1
    interface br1
        type=internal
    interface g1
        type=internal tag=1
    interface c1
        tag=1
br2
    interface br2
        type=internal
    interface g2
        type=internal tag=1
    interface c2
        tag=1

c1 can ping to c2: c1-> g1 -> g2 -> c2 )

in a more general form with 2 hosts: host1 (where all the gateways are)

br1
    interface br1
        type=internal
    interface g1
        type=internal tag=1
    interface c1
        tag=1
    interface gre1
        type=gre key=1 remote_ip=host2ip
br2
    interface br2
        type=internal
    interface g2
        type=internal tag=1
    interface c2
        tag=1
    interface gre2
        type=gre key=2 remote_ip=host2ip

host2

br1
    interface br1
        type=internal
    interface c1
        tag=1
    interface gre1
        type=gre key=1 remote_ip=host1ip
br2
    interface br2
        type=internal
    interface c2
        tag=1
    interface gre2
        type=gre key=2 remote_ip=host1ip

let's consider the following connection: ping from host1:c1 to host1:c2 host1:c1 -> host1:g1 -> host1:g2 -> host1:c2 ping from host1:c1 to host2:c1 host1:c1 -> host1:gre1 -> host2:gre1 -> host2:c1 ping from host1:c1 to host2:c2 host1:c1 -> host1:g1 -> host1:g2 -> host1:gre2 -> host2:gre2 -> host2:c2 (in fact, gre1 and gre2 are fake gre tunnels created by ovs. at the system level, only one gre tunnel exists.)

with the above inspection, we propose an extension based on the original network design

leebaok commented 8 years ago

Yeah, this is a more smooth way for improving current docklet.

Thanks !

jklj077 commented 8 years ago

first revise edition following the extension: commit b5118a1 (i don't know how to link to my repo...)

jklj077 commented 8 years ago

work to be done:

extract netid functions to a new class, maybe rewrite the way how netids are stored in etcd to allow automatic enlargement if netids are used up

isolate system bridge and user bridge, store bridge information in etcd, make user bridges set up on demand

let administrator decide how many virtual networks can link to a virtual switch

jklj077 commented 8 years ago

done:

net id management in a separate class, max id increase on demand commit 80b3a16de8d76a9b0d2351dcb9385abf65f08ad4

virtual switch management in a separate class, setup on demand commit 448562e8053710eb6949a88145315bfdff64b961

all of the above work is not tested

liwyNo commented 8 years ago

@leebaok 师兄能不能确认下现在的master是否正常进入后台,我自己测试的时候,能够创建虚拟机,但是进入会有以下的错误,master我没有修改过,也遇到这个问题。 4352838804629556

  1. 代码已经改完,但是因为上面的bug,没有完全测试,现在测试的结果是如果在服务器上来看,所有的网口网桥网关都是正常的,而且可以手动lxc-attach到container里面测试正常(但是因为上面那个bug,没有在网页端测试成功)
  2. 修改: i. 修改了vlanid分配方式,做到每个新加入用户一个唯一的vlanid,单调递增,不做回收(名叫Vlanid,其实只是一个编号,实际值远大于vlanid,甚至比ip都多,因为代码中大量使用到,怕直接做替换出问题,所以就没有替换名字,直接使用了) ii. 使用的是 @jklj077 一样的思路,也就是一个网桥有多个用户,每个用户有自己tag,对应的网桥之间用vxlan通道链接。但是因为没有创建过大量用户做过测试,不确定大量用户网桥是否正常(不过,之前实现一个用户一个网桥的版本我测试是正常产生网桥的,现在是基于那个修改的,应该正常) iii. 理论上容纳用户最多VNET_COUNT * VXLAN_KEY_COUNT,第一个参数是手动设置的每个网桥上容纳最多的用户数,我看到 @jklj077 使用类似参数默认为4094,我也取做相同值,第二个参数就是vxlan允许的key数量,理论值是2^24,所以可以容纳用户是远超过可用ip数的。。。 iv. worker的隧道在container启动的脚本里面检查并建立,所以可以保证worker一定有隧道指向master,但是master的隧道是在add_user时候会检查这个user的对应的网桥是否已经创立,不创立,则创立网桥,并向所有正在运行的worker打隧道(可能导致bug,一是如果新加入一个worker,老的网桥上没有对应隧道通向新worker,二是如果worker启动前新建了用户,可能出现之前说的情况) v. ip分配没有修改,现在的ip总数好像比较少,用户过多可能出现分配问题???(因为之前说过了,允许的用户远比ip总数还多) vi. 由于条件不足,没有做分布式的测试

分支地址:https://github.com/HawkALee/docklet/tree/Vxlan_simple

是直接pull request还是怎样?可能还有潜在的bug,因为没办法完全的测试

jklj077 commented 8 years ago

commit e57be754947a99ab012eedc5078858f1a18052d6 This commit fixed code naming issues in netidmgr, and modified the api.

It has been tested on a single host and worked fine. The functionality of netidmgr is fully tested.

It has also been tested on two hosts, with one holding the master and a worker and the other holding a worker(with global directory check disabled). From what I have done, it also works fine.

The main issue is that when a user create a vcluster for the first time or create a container on a worker for the first time, there is a in-negligible latency. If put into production, additional loading animation may be needed.

commit eeef9c1b429d07278402b591000107d5723925e0

I have made some modification to the code, removing vlanid sharing completely.

@HawkALee Congrats! From what I have learned about, the bridge setup has been done gracefully in your code by lxc start scripts. That's an excellent choice! It seems we have reached the same goal by a similar approach out of different considerations.

@leebaok It should be the conclusion of this semester's work. Thanks for your help!

jklj077 commented 8 years ago

@HawkALee your code works fine with me... maybe something wrong with your docklet.conf. from the screenshot your showed, the uri was using port 8888, which should be 8000 by default?

And maybe a small problem with the configuration of a single host hosting a master and a worker. when add user, the master try to setup a vxlan tunnel to the worker, that is, itself. when start a container, the worker try to setup a vxlan tunnel to the master, that is, itself (because --may-exist, it makes no effect). a tunnel to itself seems weird.

Also, the vxlan vni starts from 0, which should be 1 and the vlan id in a virtual switch starts from 2, which is not a problem.

leebaok commented 8 years ago

So sorry for noticing the update of this issue so late. @jklj077 @HawkALee ( Why no notification emails from github? )

@HawkALee 8888 is the master port. But we have a proxy server with port 8000 in front. So you should use 8000 to test.

Thank you for your work, your ideas, your discussion, etc. @jklj077 @HawkALee