loxilb-io / loxilb

eBPF based cloud-native load-balancer for Kubernetes|Edge|Telco|IoT|XaaS.
https://www.loxilb.io
Apache License 2.0
1.48k stars 122 forks source link

LoxiLB with multi-AZ HA support in AWS don't working #813

Closed agixio closed 3 weeks ago

agixio commented 2 months ago

Describe the bug I followed all instructions in the docs to deploy loxilb in multi az ha with aws but my 2 loxilb instances timeout: 2024-09-27 15:18:29 XSync netRPC Connect - 192.168.228.58:22222 :Fail(dial tcp 192.168.228.58:22222: i/o timeout) 2024-09-27 15:18:24 XSync netRPC Connect - 192.168.218.57:22222 :Fail(dial tcp 192.168.218.57:22222: i/o timeout)

(more error at the google drive link)

To Reproduce Follow the documentations here: https://github.com/loxilb-io/loxilbdocs/blob/main/docs/aws-multi-az.md

Screenshots I have all screenshots at this link: https://docs.google.com/document/d/1DFvKcP8WCYQhhud0h9FWDaQoYm2iX417t7Uit1ObRcg/edit?usp=sharing

TrekkieCoder commented 2 months ago

Havent yet looked at the detailed logs but it seems like the loxilb instances cant communicate over grpc channel. This is usually due to various reasons:

agixio commented 2 months ago

Indeed I haven't done all that, I'll test it and come back to you afterwards. thanks !

UltraInstinct14 commented 2 months ago

The doc did not mention setting up of inbound security groups for loxilb instance. It has been updated here. You can check if things are working by trying :

## From loxilb1
$ nc <loxilb2-instance-IP> 11111 -v
Connection to <loxilb2-instance-IP> 11111 port [tcp/*] succeeded!

$ nc <loxilb2-instance-IP> 22222 -v
Connection to <loxilb2-instance-IP> 22222 port [tcp/*] succeeded!

## From loxilb2
$ nc <loxilb1-instance-IP> 11111 -v
Connection to <loxilb1-instance-IP> 11111 port [tcp/*] succeeded!

$ nc <loxilb1-instance-IP> 22222 -v
Connection to <loxilb1-instance-IP> 22222 port [tcp/*] succeeded!
agixio commented 1 month ago

@UltraInstinct14 @TrekkieCoder Thank you very much for your help. Everything works better with all the rules in place, and now they can communicate properly. The HA status is also fine. However, after completing all the steps, the Elastic IP is still not associated with any of my EC2 instances, which is strange. I followed the exact same network architecture and used the same CIDR as in the tutorial.

We have an error talking to the kernel
2024/09/29 23:00:32 Serving loxilb rest API at http://[::]:11111
2024-09-29 23:00:33 [API] HA POST API called. url : /netlox/v1/config/cistate
2024-09-29 23:00:33 [API] Instance default New HA State : BACKUP, VIP: 0.0.0.0
2024-09-29 23:00:33 [CLUSTER] Instance default Current State NOT_DEFINED Updated State: BACKUP VIP : 0.0.0.0
2024-09-29 23:00:33 failed to get ENI intf name ()
2024-09-29 23:00:33 [API] Load balancer POST API called. url : /netlox/v1/config/loadbalancer
2024-09-29 23:00:33 [API] lbRules : {{35.180.153.68 192.168.248.254 55002 tcp 0 0 false false 2 1800 true  0   0 0 default_nginx-lb1 0} [] [{192.168.36.180 31630 1  }]}
2024-09-29 23:00:33 ep-host added 192.168.36.180_tcp_31630:0
2024-09-29 23:00:33 fullnat:suitable source for 192.168.36.180: 192.168.248.254
2024-09-29 23:00:33 nat lb-rule added - 1:dst-35.180.153.68/32,proto-6,dport-55002,-do-fullnat:eip-192.168.36.180,ep-31630,w-1,alive|
2024-09-29 23:00:33 Added cluster-peer 192.168.228.173
2024-09-29 23:00:33 [DP] LB rule 192.168.248.254 add[OK]
2024-09-29 23:00:35 inactive ep - 192.168.36.180_tcp_31630:tcp:31630(next try after 60s)
2024-09-29 23:00:35 [NLP] Link msgs subscribed

...

2024-09-29 23:00:42 [API] HA POST API called. url : /netlox/v1/config/cistate
2024-09-29 23:00:42 [API] Instance default New HA State : MASTER, VIP: 0.0.0.0
2024-09-29 23:00:42 [CLUSTER] Instance default Current State BACKUP Updated State: MASTER VIP : 0.0.0.0
2024-09-29 23:00:42 no loxiType intf found
2024-09-29 23:00:42 Get xsync()
2024-09-29 23:00:42 XSync netRPC - 192.168.228.173:22222 :Connected
2024-09-29 23:00:42 RPC - CT Xsync Remote-1
2024-09-29 23:00:42 cidrBlock (192.168.248.0/24) associate failed in VPC vpc-09d80cc6acffaad98:operation error EC2: AssociateVpcCidrBlock, https response error StatusCode: 400, RequestID: 6d3a8c7f-ccd7-4cdc-82d3-750360f3c8e9, api error CidrConflict: CIDR range conflicts with 192.168.0.0/16 with association ID vpc-cidr-assoc-0acaa95895001e808
2024-09-29 23:00:44 XSync netRPC - 192.168.228.173:22222 :Reset
2024-09-29 23:00:44 XSync netRPC - 192.168.228.173:22222 :Connected
2024-09-29 23:00:44 RPC -  CT Get 1
2024-09-29 23:00:44 CT Bcast
2024-09-29 23:00:44 [CT]  CTBcast Complete
23:01:02 TRACE loxilb_libdp.c:2592: ct: #192.168.218.226:35631 -> 192.168.0.2:53 (17)# rid:0 est:0 nat:0 (Aged:202194512
TrekkieCoder commented 1 month ago

Sorry for the inconvenience. After trying the scenario again from the tutorial doc, it seems the document has not been updated for the latest versions. I will try to summarize what things need to be done differently.

  1. Run loxilb with a cloudcidrblock subnet which is not currently used in the VPC. For example use 124.124.124.0/24 CIDR :

    sudo docker run -u root --cap-add SYS_ADMIN \
    --restart unless-stopped \
    --net=host \
    --privileged \
    -dit \
    -v /dev/log:/dev/log -e AWS_REGION=ap-northeast-2 \
    --name loxilb \
    ghcr.io/loxilb-io/loxilb:latest \
    --cloud=aws --cloudcidrblock=124.124.124.0/24

    Please note that image to be used is ghcr.io/loxilb-io/loxilb:latest . "aws-support" labeled image has been discontinued.

  2. Change kube-loxilb.yaml to have a private CIDR VIP of the above CIDR subnet:

    - --externalCIDR=13.208.X.X/32
    - --privateCIDR==124.124.124.250/32
    - --setRoles=0.0.0.0
  1. Additionally, please edit security in EKS nodes to allow traffic from this CIDR subnet because loxilb will use this subnet to send traffic to eks nodes in "fullnat" mode.
agixio commented 1 month ago

@TrekkieCoder Thank you. No problem at all regarding the documentation being slightly out of date; I completely understand. I'm happy to help a bit to update it to make it accessible to everyone. I followed your file configuration exactly as provided, but it didn’t work. I then tried various mixing your with the arguments from the documentation, but still unable to mount the Elastic IP. I don't know where i messed.

Screenshot 2024-09-30 at 23 32 20 Screenshot 2024-09-30 at 23 34 08 Screenshot 2024-09-30 at 23 38 46

Log trace link: https://docs.google.com/document/d/1ZFliJHdDZ3ruM30V6TxiICTCBxl1HsECS6B3hcVlZCg/edit?usp=sharing

TrekkieCoder commented 1 month ago

Hi @agixio ,

Your log trace file cant be opened since it asks for permissions.

https://docs.google.com/document/d/1ZFliJHdDZ3ruM30V6TxiICTCBxl1HsECS6B3hcVlZCg/edit?usp=sharing

agixio commented 1 month ago

My bad: https://docs.google.com/document/d/18xf9R6WpDWyzmxnzM_sOny9gP2S-hwOpIkPgPahYkN4/edit?usp=sharing

TrekkieCoder commented 1 month ago

My bad: https://docs.google.com/document/d/18xf9R6WpDWyzmxnzM_sOny9gP2S-hwOpIkPgPahYkN4/edit?usp=sharing

Double checked the logs. But there are no logs related to service creation ?

agixio commented 1 month ago

@TrekkieCoder: My service don't work but sorry i sent u the old log with the deprecated docker image, i updated everything with all pictures :

https://docs.google.com/document/d/1MEkQzmVEvkiOgLSS6OnqyAVaZLnNgrl2-I8q6KDTipg/edit?usp=sharing

TrekkieCoder commented 3 weeks ago

I am closing this since this issue has been resolved via discussion in loxilb slack channel.