gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.44k stars 550 forks source link

[Bug]: opening https://dashboard.nm.[IP].nip.io/enrollment-keys takes 30 sec #3026

Closed cryply closed 1 month ago

cryply commented 2 months ago

Contact Details

No response

What happened?

Recently I have found that EnrollmentKeys generation page taking very long to load. About 30sec. It is bearable but annoying. Click on createkey also takes a minute. Same 30 sec during login. Also OK. Worst is when u join host with netclient. There timeout exceeded and I simply cannot join client to netmaker. Has to use wireguard client.

image

image

OS: Ubuntu 22.04

Version

v0.23.0

What OS are you using?

Linux

Relevant log output

172.20.0.1 - - [03/Aug/2024:09:01:23 +0000] "GET /enrollment-keys HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:09:01:23 +0000] "GET /loader.svg HTTP/1.1" 200 593 "https://dashboard.nm.MYIP.nip.io/enrollment-keys" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:09:06:19 +0000] "GET /enrollment-keys HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:09:06:20 +0000] "GET /loader.svg HTTP/1.1" 200 593 "https://dashboard.nm.MYIP.nip.io/enrollment-keys" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:09:06:20 +0000] "GET /nmui-config.js HTTP/1.1" 200 156 "https://dashboard.nm.MYIP.nip.io/enrollment-keys" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:09:12:16 +0000] "GET /enrollment-keys HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:12:23:35 +0000] "GET /enrollment-keys HTTP/1.1" 304 0 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:12:23:35 +0000] "GET /loader.svg HTTP/1.1" 200 593 "https://dashboard.nm.MYIP.nip.io/enrollment-keys" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:12:23:35 +0000] "GET /nmui-config.js HTTP/1.1" 200 156 "https://dashboard.nm.MYIP.nip.io/enrollment-keys" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1"
172.20.0.1 - - [03/Aug/2024:12:23:35 +0000] "GET /login HTTP/1.1" 200 1233 "https://dashboard.nm.MYIP.nip.io/login" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "172.20.0.1

in netmaker docker logs:

{"time":"2024-08-03T12:38:52.641384513Z","level":"ERROR","source":"handlers.go 95}","msg":"failed to decrypt message for host","id":"XXXXXXX-XXXX-XXXX-XXXXXXXXX","error":"could not decrypt message, [

Contributing guidelines

yabinma commented 2 months ago

The decrypt message issue was reported in https://github.com/gravitl/netmaker/issues/2686. There was changes in v0.24.3. The issue should not be in server side in latest version.

Regarding the join slow issue, what's the server size in hardware and how many hosts totally in the network?

cryply commented 2 months ago

Thanks Max Ma

  1. I will try to upgrade server to v0.24.3. Is it 100% necessary to upgrade netclients?
  2. Server is 12 Core , 64GB RAM physical server. 100% idle.
  3. We have about 80 hosts.

Problem started AFAIK after I have tinkered with docker-compose.yml

and changed : image: coredns/coredns -> image: coredns/coredns:1.10.1 It did actually started CoreDNS but I am still not sure how clients can access it. also I commented all turn related stuff. and restart.

cryply commented 2 months ago

Regarding join

docker compose logs -f

netmaker       | [netmaker] 2024-08-04 14:25:11 received registration attempt with token eXXXXXXXXXXX9 
netmaker       | [netmaker] 2024-08-04 14:25:51 user1 xxxxxxxx-cccccc-zzzzz-yyyyy-xxxxxxxx registered with Netmaker 
netmaker       | [netmaker] 2024-08-04 14:25:51 added new node 11111-22222-33333-44444-555555555 to host user1 

but on client side I get:

sudo netclient join -t eXXXXXXXXXXXXXXXX9
setting host fields
[netclient] 2024-08-04 16:25:41 registration failed Post "https://api.nm.MUIP.nip.io/api/v1/host/register/
eXXXXXXXXXXXXXXXXXXXXXXX9": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
cryply commented 2 months ago

lol. I have stopped CoreDNS and managed to join client! I also monitored process on join in htop. CoreDNS skyrocket for 100% CPU for few sec.

cryply commented 2 months ago

Unable to join again

[netclient] 2024-08-05 11:12:21 registration failed Post "https://api.nm.PYIP.nip.io/api/v1/host/register/ETOKEYXXX": context deadline exceeded (Client.Timeout exceeded while awaiting headers) - OMG this error is back. No matter if CoreDNS up/down.

yabinma commented 2 months ago

Unable to join again

[netclient] 2024-08-05 11:12:21 registration failed Post "https://api.nm.PYIP.nip.io/api/v1/host/register/ETOKEYXXX": context deadline exceeded (Client.Timeout exceeded while awaiting headers) - OMG this error is back. No matter if CoreDNS up/down.

Is there any error from server side? Even without any error, what's the log when the server received the registration request?

cryply commented 2 months ago

On server side it is working fine(just taking long). client added but just 15 seconds after popup error on client side.

yabinma commented 2 months ago

On server side it is working fine(just taking long). client added but just 15 seconds after popup error on client side.

How long does it take in server side? From the registration is received to node added. What's the network range when adding a new network?

cryply commented 2 months ago

netmaker | [netmaker] 2024-08-05 12:44:55 received registration attempt with token eToken netmaker | [netmaker] 2024-08-05 12:45:35 hostX xxxxxxx-rrrrr-gggg-ffff-zzzzzzz registered with Netmaker
netmaker | [netmaker] 2024-08-05 12:45:35 added new node yyyyyyy-wwwww-qqqq-hhhhh-zzzzzzz to host hostX

network range /16

yabinma commented 2 months ago

netmaker | [netmaker] 2024-08-05 12:44:55 received registration attempt with token eToken netmaker | [netmaker] 2024-08-05 12:45:35 hostX xxxxxxx-rrrrr-gggg-ffff-zzzzzzz registered with Netmaker netmaker | [netmaker] 2024-08-05 12:45:35 added new node yyyyyyy-wwwww-qqqq-hhhhh-zzzzzzz to host hostX

network range /16

There is no timeout in default settings. The client side should be waiting for the response from sever side even it took 40 sec as shown in the log. Is there any timeout setting in network?

cryply commented 2 months ago

time sudo netclient join -t eXXXXXX setting host fields [netclient] 2024-08-05 13:38:20 registration failed Post "https://api.nm.IP.nip.io/api/v1/host/register/eXXXXXXXXXXX" : context deadline exceeded (Client.Timeout exceeded while awaiting headers)

real 0m30.110s user 0m0.009s sys 0m0.000s

cryply commented 2 months ago

It was working for 1 year without issues.

cryply commented 2 months ago

30 seconds in client

image

cryply commented 2 months ago

It seems it comes from devilcoves httpclient

https://github.com/devilcove/httpclient/blob/master/httpclient.go

func init() { Client = http.Client{ Timeout: 30 * time.Second, } }

yabinma commented 2 months ago

It seems it comes from devilcoves httpclient

https://github.com/devilcove/httpclient/blob/master/httpclient.go

func init() { Client = http.Client{ Timeout: 30 * time.Second, } }

This should be the right timeout setting for restful api.

Revisited the log and found abnormal, netmaker | [netmaker] 2024-08-05 12:44:55 received registration attempt with token eToken netmaker | [netmaker] 2024-08-05 12:45:35 hostX xxxxxxx-rrrrr-gggg-ffff-zzzzzzz registered with Netmaker

The host registration took more than 30s already(before the htttp response). I did not meet the issue before.

Can you please change the server log verbosity to 4 and re-produce the issue and let's see if there are more clues found?

cryply commented 2 months ago

Can you please change the server log verbosity to 4 and re-produce the issue and let's see if there are more clues found? How?

yabinma commented 2 months ago

Can you please change the server log verbosity to 4 and re-produce the issue and let's see if there are more clues found? How?

Edit netmaker.env in install directory and change to VERBOSITY=4, and then restart the containers.

cryply commented 2 months ago

Now , after restarting , its registering client in 20 sec. Will monitor further

cryply commented 2 months ago

Creating Enrollment Token faster now. Takes only 15 sec not 1 min as before. In netmaker logs I can see :

netmaker       | [netmaker] 2024-08-06 09:17:47 [hosts.go-73] getHosts(): netadmin fetched all hosts                                                                                                                                                                                                                           
netmaker       | [netmaker] 2024-08-06 09:17:47 [node.go-324] getAllNodes(): netadmin fetched all nodes they have access to                                                                                                                                                                                                    
netmaker       | {"time":"2024-08-06T09:17:47.273107578Z","level":"INFO","source":"handlers.go 103}","msg":"recieved host update","name":"myuser","id":"XXXX"}                                                                                                                     
netmaker       | {"time":"2024-08-06T09:17:47.292303329Z","level":"INFO","source":"handlers.go 304}","msg":"check-in processed for host","name":"myuser","id":"xxxxxxx"}                                                                                                              
....... repeat of above for all nm hosts......
netmaker       | [netmaker] 2024-08-06 09:17:57 [hosts.go-73] getHosts(): netadmin fetched all hosts                                                                                                                                                                                                                           
netmaker       | [netmaker] 2024-08-06 09:17:57 [node.go-324] getAllNodes(): netadmin fetched all nodes they have access to  
netmaker       | [netmaker] 2024-08-06 09:18:02 [enrollmentkeys.go-178] createEnrollmentKey(): netadmin created enrollment key                                                                                                                                                                                                 
netmaker       | [netmaker] 2024-08-06 09:18:02 [publishers.go-237] sendPeers(): sending scheduled peer update (5 min)  

so u can see timing... token generated after getAllNodes() executed.

well its happened 2 times...

testingNetqa commented 2 months ago

Unfortunately, I have to say that the problem described occurs very frequently in our company. This means that 80% of the netclient pull executions end in this error. It would be very helpful if either the timeout was set to a higher value by default or if it could be configured.

cryply commented 1 month ago

Ok you can close it. Issue was DNS request wasnt processed ion netmaker hosts and as results : and access to ip.server.gravitl.com was blocked. after enabling DNS all start working

yabinma commented 1 month ago

closing as request. Thanks for the update.