milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.67k stars 2.92k forks source link

[Feature]: Allow user to specify listening ip address for milvus components #26538

Open LoveEachDay opened 1 year ago

LoveEachDay commented 1 year ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

By default, milvus components will use funcutil.GetLocalIP() to initialize grpc listening ip address, which may not work when pod/container has multiple network interfaces such as docker swarm, see also: https://github.com/milvus-io/milvus/issues/25032

func (p *grpcConfig) init(domain string, base *BaseTable) {
    p.Domain = domain
    p.IP = funcutil.GetLocalIP()

    p.Port = ParamItem{
        Key:          p.Domain + ".port",
        Version:      "2.0.0",
        DefaultValue: strconv.FormatInt(ProxyExternalPort, 10),
        Export:       true,
    }
    p.Port.Init(base.mgr)
       ...
}
// GetLocalIP return the local ip address
func GetLocalIP() string {
    addrs, err := net.InterfaceAddrs()
    if err == nil {
        for _, addr := range addrs {
            ipaddr, ok := addr.(*net.IPNet)
            if ok && ipaddr.IP.IsGlobalUnicast() && ipaddr.IP.To4() != nil {
                return ipaddr.IP.String()
            }
        }
    }
    return "127.0.0.1"
}

Describe the solution you'd like.

In the milvus.yaml we can add an address config for each components which allow user to specify a routable address like this:

# configs/milvus.yaml
queryCoord:
   address: <ip address>
   port: 19531

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

LoveEachDay commented 1 year ago

/assign @jiaoew1991

xiaofan-luan commented 1 year ago

/assign @smellthemoon can you help on it?

congqixia commented 1 year ago

@LoveEachDay will "0.0.0.0" do for docker swarm case?

LoveEachDay commented 1 year ago

@congqixia 0.0.0.0 may not work. Milvus component will store its serving ip address in the etcd, so that other components can access through. If it is listening on all network interfaces, which ip address should we stored in the session?

We have two problems here:

  1. which ip address to listening,
  2. ip address should to be routable

Bind to 0.0.0.0 may solve case (1), but does not solve case (2).

smellthemoon commented 1 year ago

working on it

ieugen commented 1 year ago

@LoveEachDay : I believe in other distributed systems there is an --advertise-address option where you can set what address to advertise to clients.

--advertise-address string   | The IP address on which to advertise the apiserver to members of the cluster. This address must be reachable by the rest of the cluster. If blank, the --bind-address will be used. If --bind-address is unspecified, the host's default interface will be used.

ieugen commented 1 year ago

In the case of docker and kubernetes IP addresses are dynamically allocated.

It might make sense to be able to specify a CIDR network range to filter the IP or a network interface name as well ?!

LoveEachDay commented 1 year ago

@LoveEachDay : I believe in other distributed systems there is an --advertise-address option where you can set what address to advertise to clients.

--advertise-address string   | The IP address on which to advertise the apiserver to members of the cluster. This address must be reachable by the rest of the cluster. If blank, the --bind-address will be used. If --bind-address is unspecified, the host's default interface will be used.

@ieugen We use the way to set up the advertise addr. In Kubernetes, only one pod ip address will be assigned, you can leave the address field to be empty, which will used the default ip address.

Pass a cidr may not feasible. It's the user ability to make sure the ip address is routable and can be statically assigned.

ieugen commented 1 year ago

@LoveEachDay

I think some filtering could be implemented so that the code would work like this: See original code bellow my pseudo code attempt (I don't know go).

func GetIP (ip string, options)  {
        if len(ip) == 0 {
        return GetLocalIP(options)
    }
    return ip
}

// GetLocalIP return the local ip address
func GetLocalIP(options ??? ) string {
    addrs, err := net.InterfaceAddrs()
    if err == nil {
        for _, addr := range addrs {
            ipaddr, ok := addr.(*net.IPNet)
            if ok && matchIpOptions(ipaddr,options) {
                return ipaddr.IP.String()
            }
        }
    }
    return "127.0.0.1"
}

func matchIpOptions(ipaddr, options) {

 // check ipaddr.IP.IsGlobalUnicast() && ipaddr.IP.To4() != nil 
 and 
 // check if ipaddr matches the filtering options in "options"
 // checks could be any checks supported by https://pkg.go.dev/net#Addr .   
  // parse CIDR received in options with  https://pkg.go.dev/net#ParseCIDR in and check if IP is in the network
 // check if 
 // This might be usefull to get network name from IP https://pkg.go.dev/net#IPAddr.Network
 // This builds a network object (from a CIDR ?) https://pkg.go.dev/net#IPAddr.Network
// This checks if IP is inside network https://pkg.go.dev/net#IPNet.Contains
}

Original code here https://github.com/milvus-io/milvus/blob/eca79d149cc2d5522f43a1f2c5f2a5bfcc71c8ca/pkg/util/funcutil/func.go#L62C13-L62C13

func GetIP(ip string) string {
    if len(ip) == 0 {
        return GetLocalIP()
    }
    return ip
}

// GetLocalIP return the local ip address
func GetLocalIP() string {
    addrs, err := net.InterfaceAddrs()
    if err == nil {
        for _, addr := range addrs {
            ipaddr, ok := addr.(*net.IPNet)
            if ok && ipaddr.IP.IsGlobalUnicast() && ipaddr.IP.To4() != nil {
                return ipaddr.IP.String()
            }
        }
    }
    return "127.0.0.1"
}
ieugen commented 1 year ago

@LoveEachDay : I have added more information connected to this issue https://github.com/milvus-io/milvus/issues/25032#issuecomment-1780720924 .

In docker swarm, addding healthchecks to milvus makes the service run eventually. It restarts until it binds to the proper interface / IP address. I guess this is due to the how the IP is selected.

Perhaps different components each go through the IP selection process and reach different results => failures ?!

xiaofan-luan commented 1 year ago

this should be supported on milvus 2.3.2

ieugen commented 1 year ago

thank you @xiaofan-luan . I am curios which parts. For docker swarm, since we can't specify container IP, we would need a CIDR based filter (we specify the CIDR the app should filter interfaces, when it finds an IP in that CIDR range, it should bind to it).

There is an old issue with swarm and IP addresses for containers: https://github.com/moby/moby/issues/24170 .

To have an idea, this is what ip addr shows inside a container.
Container IP addresses are randomly assigned - one is ip inide docker_gwbridge network and the other is container IP for it's associated network.

root@dev_db-1-dev1:/# ip addr 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
678: eth0@if679: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue state UP group default 
    link/ether REDACTED brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.1.186/24 brd 10.0.1.255 scope global eth0
       valid_lft forever preferred_lft forever
680: eth1@if681: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether REDACTED brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
       valid_lft forever preferred_lft forever
ieugen commented 8 months ago

On a server with IPV6 address, the service will restart continuously. So my workaround for the IP binding bug is not working. I have to setup a special IPV4 only server for milvus. If it where for me I would ban milvus, but I only manage the apps so I have to put up with it.

xiaofan-luan commented 8 months ago

how you you deploy any database if you can specify which port to use? For example, pg will use 5432

@congqixia 0.0.0.0 may not work. Milvus component will store its serving ip address in the etcd, so that other components can access through. If it is listening on all network interfaces, which ip address should we stored in the session?

We have two problems here:

  1. which ip address to listening,
  2. ip address should to be routable

Bind to 0.0.0.0 may solve case (1), but does not solve case (2).

the internal design of milvus works fine. It stored both endpoint and port.

I think the problem is how SDK connect to milvus

xiaofan-luan commented 8 months ago

g l

On a server with IPV6 address, the service will restart continuously. So my workaround for the IP binding bug is not working. I have to setup a special IPV4 only server for milvus. If it where for me I would ban milvus, but I only manage the apps so I have to put up with it.

Non of us are docker swarm expert and milvus is ideally designed for K8s. Feel free to directly issue a pr on how to deploy milvus on docker swarm and this could definitely help more users.

ieugen commented 8 months ago

hi @xiaofan-luan ,

Thanks for replying. It's not docker swarm specific. The issue is with how milvus selects the interface to bind to / connect to other services. Right now it's completely non-deterministic and non-configurable. So on any system with more than one network interface it is likely to fail.

It's more prevalent in Docker Swarm because swarm has 1 network interface per network (2 + localhost minimum) and I can't do anything about that unfortunatelly (because of Swarm limitations, not milvus). This is unfortunate for me because I have to deal with it.

Yesterday I was very helpless when I had to "fix" this after a 12h weekend upgrade of our servers. I was tired but everything was working ok - except milvus . It managed to stir a lot of frustration inside.

The logic for selecting the IP is exemplified above. As you can see - it's a for loop and gets the first network interface that it finds. If the OS list is random, it might work, it might not. If the OS list is sorted, and the first valid interface is not the right now, it will always fail.

https://github.com/milvus-io/milvus/issues/26538#issuecomment-1752440264

I considered fixing this myuself - but I don't know go and I am not working with milvus directly. I just deploy and maintain it - and have to deal with the outages, etc.

For some reason it started working after a few hours and a lot of config tweeking. I am afraid to restart it and migrate it - but I do have to do server maintenance, so there is no avoiding that. The only solution IMO would be a fix where users can select the network interface either by name or better IMO by matching the interface IP to a CIDR subnet supplied by user.

User supplies 10.11.0.0/24 CIDR (or IPV6 one ?! ) and milvus can check if the network interface has an IP in that range. If no such interface is found, an error is thrown (since user requested a specific network interface) .

The code IMO should look something like this:

network := "192.168.5.0/24"
clientips := []string{
    "192.168.5.1",
    "192.168.6.0",
}
_, subnet, _ := net.ParseCIDR(network)
for _, clientip := range clientips {
    ip := net.ParseIP(clientip)
    if subnet.Contains(ip) {
        fmt.Println("IP in subnet", clientip)
    }
}
xiaofan-luan commented 8 months ago

hi @xiaofan-luan ,

Thanks for replying. It's not docker swarm specific. The issue is with how milvus selects the interface to bind to / connect to other services. Right now it's completely non-deterministic and non-configurable. So on any system with more than one network interface it is likely to fail.

It's more prevalent in Docker Swarm because swarm has 1 network interface per network (2 + localhost minimum) and I can't do anything about that unfortunatelly (because of Swarm limitations, not milvus). This is unfortunate for me because I have to deal with it.

Yesterday I was very helpless when I had to "fix" this after a 12h weekend upgrade of our servers. I was tired but everything was working ok - except milvus . It managed to stir a lot of frustration inside.

The logic for selecting the IP is exemplified above. As you can see - it's a for loop and gets the first network interface that it finds. If the OS list is random, it might work, it might not. If the OS list is sorted, and the first valid interface is not the right now, it will always fail.

#26538 (comment)

I considered fixing this myuself - but I don't know go and I am not working with milvus directly. I just deploy and maintain it - and have to deal with the outages, etc.

For some reason it started working after a few hours and a lot of config tweeking. I am afraid to restart it and migrate it - but I do have to do server maintenance, so there is no avoiding that. The only solution IMO would be a fix where users can select the network interface either by name or better IMO by matching the interface IP to a CIDR subnet supplied by user.

User supplies 10.11.0.0/24 CIDR (or IPV6 one ?! ) and milvus can check if the network interface has an IP in that range. If no such interface is found, an error is thrown (since user requested a specific network interface) .

The code IMO should look something like this:

network := "192.168.5.0/24"
clientips := []string{
    "192.168.5.1",
    "192.168.6.0",
}
_, subnet, _ := net.ParseCIDR(network)
for _, clientip := range clientips {
    ip := net.ParseIP(clientip)
    if subnet.Contains(ip) {
        fmt.Println("IP in subnet", clientip)
    }
}

This could be an option. has a config useCIDR and let the admin manage this. It seems extremely difficult for a distributed database to work with docker swarm. Sorry about the trouble. Maybe using standalone can solve the problem somehow.

@LoveEachDay what do you think about adding this cidr check and also check IsGlobalUnicast

LoveEachDay commented 8 months ago

@xiaofan-luan @ieugen Add a cidr check would be a good option to cover this scenario which has multiple ips for milvus components. I would suggest we add a config option in the milvus.yaml like this:

// configs/milvus.yaml
rootCoord:
  ip: ""
  cidr: ""
  port: 53100

And in the pkg/util/paramtable//grpc_prams.go, we take up the config initialization logic here.

func (p *grpcConfig) init(domain string, base *BaseTable) {
    p.Domain = domain
    ipItem := ParamItem{
        Key:          p.Domain + ".ip",
        Version:      "2.3.3",
        DefaultValue: "",
        Export:       true,
    }
    ipItem.Init(base.mgr)
       // here add cidr initialization 
    p.IP = funcutil.GetIP(ipItem.GetValue()) // override GetIP util helper to take cidr in account.
...
}
ieugen commented 8 months ago

Any plans to get this feature in a release ?