gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.4k stars 547 forks source link

[Bug]: I have set up an RQLite cluster but Netmaker is unable to connect to the db. #1173

Closed JamesD4 closed 2 years ago

JamesD4 commented 2 years ago

Contact Details

james@duffy.app

What happened?

Documentation: Netmaker will be started on each node with default settings, except with DATABASE=rqlite (or DATABASE=postgress) and SQL_CONN set appropriately to reach the local rqlite instance. Rqlite will maintain consistency with each Netmaker backend.

Issue: I have set up an RQLite cluster on two hosts on a private subnet but Netmaker is unable to connect to the db.

The machines have joined an RQLite cluster successfully.

The documentation states to set SQL_CONN appropriately - how should this be formatted for Netmaker/RQLite? Could you provide an example, please?

Thanks and best wishes, James

Version

v0.14.2

What OS are you using?

Linux

Relevant log output

NETMAKER
[netmaker] 2022-06-03 00:48:47 connecting to rqlite 
[netmaker] 2022-06-03 00:48:47 unable to connect to db, retrying . . . 
[netmaker] 2022-06-03 00:48:49 unable to connect to db, retrying . . . 
[netmaker] 2022-06-03 00:48:51 unable to connect to db, retrying . . . 
[netmaker] 2022-06-03 00:48:53 unable to connect to db, retrying . . . 
[netmaker] 2022-06-03 00:48:55 unable to connect to db, retrying . . . 
[netmaker] 2022-06-03 00:48:57 unable to connect to db, retrying . . . 
[netmaker] Fatal: Error connecting to database 

RQLITE
root@dev:~# docker logs -f 99668bdc255a

            _ _ _
           | (_) |
  _ __ __ _| |_| |_ ___
 | '__/ _  | | | __/ _ \   The lightweight, distributed
 | | | (_| | | | ||  __/   relational database.
 |_|  \__, |_|_|\__\___|
         | |               www.rqlite.io
         |_|

[rqlited] 2022/06/02 23:56:10 rqlited starting, version v7.5.0, commit 3fa6c506726962bff3db4a9956f2bc662b77a12e, branch master, compiler gc
[rqlited] 2022/06/02 23:56:10 go1.17, target architecture is amd64, operating system target is linux
[rqlited] 2022/06/02 23:56:10 launch command: /bin/rqlited -http-adv-addr 99668bdc255a:4001 -node-id 1 -http-addr 0.0.0.0:4001 -raft-addr 0.0.0.0:4002 -http-adv-addr 10.1.0.3:4001 -raft-adv-addr 10.1.0.3:4002 -auth /host/auth.json /rqlite/file/data
[rqlited] 2022/06/02 23:56:10 Raft TCP mux Listener registered with 1
[rqlited] 2022/06/02 23:56:10 no preexisting node state detected in /rqlite/file/data, node may be bootstrapping
[store] 2022/06/02 23:56:10 opening store with node ID 1
[store] 2022/06/02 23:56:10 configured for an in-memory database
[store] 2022/06/02 23:56:10 ensuring directory for Raft exists at /rqlite/file/data
[store] 2022/06/02 23:56:10 0 preexisting snapshots present
[mux] 2022/06/02 23:56:10 mux serving on [::]:4002, advertising 10.1.0.3:4002
[store] 2022/06/02 23:56:10 first log index: 0, last log index: 0, last command log index: 0:
[store] 2022/06/02 23:56:10 created in-memory database at open
2022-06-02T23:56:10.489Z [INFO]  raft: initial configuration: index=0 servers=[]
[cluster] 2022/06/02 23:56:10 service listening on 10.1.0.3:4002
[rqlited] 2022/06/02 23:56:10 cluster TCP mux Listener registered with 2
[http] 2022/06/02 23:56:10 execute queue processing started with capacity 128, batch size 16, timeout 50ms
[http] 2022/06/02 23:56:10 service listening on [::]:4001
[rqlited] 2022/06/02 23:56:10 bootstraping single new node
2022-06-02T23:56:10.490Z [INFO]  raft: entering follower state: follower="Node at 10.1.0.3:4002 [Follower]" leader-address= leader-id=
[rqlited] 2022/06/02 23:56:10 node HTTP API available at http://10.1.0.3:4001
[rqlited] 2022/06/02 23:56:10 connect using the command-line tool via 'rqlite -H 10.1.0.3 -p 4001'
2022-06-02T23:56:12.361Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2022-06-02T23:56:12.361Z [INFO]  raft: entering candidate state: node="Node at 10.1.0.3:4002 [Candidate]" term=2
2022-06-02T23:56:12.365Z [INFO]  raft: election won: tally=1
2022-06-02T23:56:12.365Z [INFO]  raft: entering leader state: leader="Node at 10.1.0.3:4002 [Leader]"
[store] 2022/06/03 00:01:58 received request from node with ID 2, at 10.1.0.4:4002, to join this node
2022-06-03T00:01:58.673Z [INFO]  raft: updating configuration: command=AddVoter server-id=2 server-addr=10.1.0.4:4002 servers="[{Suffrage:Voter ID:1 Address:10.1.0.3:4002} {Suffrage:Voter ID:2 Address:10.1.0.4:4002}]"
2022-06-03T00:01:58.676Z [INFO]  raft: added peer, starting replication: peer=2
[store] 2022/06/03 00:01:58 node with ID 2, at 10.1.0.4:4002, joined successfully as voter
2022-06-03T00:01:58.687Z [WARN]  raft: appendEntries rejected, sending older logs: peer="{Voter 2 10.1.0.4:4002}" next=1
2022-06-03T00:01:58.693Z [INFO]  raft: pipelining replication: peer="{Voter 2 10.1.0.4:4002}"
2022-06-03T00:48:45.516Z [INFO]  raft: aborting pipeline replication: peer="{Voter 2 10.1.0.4:4002}"
2022-06-03T00:48:45.570Z [ERROR] raft: failed to heartbeat to: peer=10.1.0.4:4002 error=EOF
2022-06-03T00:48:45.616Z [ERROR] raft: failed to appendEntries to: peer="{Voter 2 10.1.0.4:4002}" error="dial tcp 10.1.0.4:4002: connect: connection refused"
2022-06-03T00:48:45.683Z [ERROR] raft: failed to appendEntries to: peer="{Voter 2 10.1.0.4:4002}" error="dial tcp 10.1.0.4:4002: connect: connection refused"
2022-06-03T00:48:46.017Z [WARN]  raft: failed to contact: server-id=2 time=500.894941ms
2022-06-03T00:48:46.018Z [WARN]  raft: failed to contact quorum of nodes, stepping down
2022-06-03T00:48:46.018Z [INFO]  raft: entering follower state: follower="Node at 10.1.0.3:4002 [Follower]" leader-address= leader-id=
[store] 2022/06/03 00:48:46 received request from node with ID 2, at 10.1.0.4:4002, to join this node
2022-06-03T00:48:47.027Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2022-06-03T00:48:47.027Z [INFO]  raft: entering candidate state: node="Node at 10.1.0.3:4002 [Candidate]" term=3
2022-06-03T00:48:47.040Z [INFO]  raft: election won: tally=2
2022-06-03T00:48:47.040Z [INFO]  raft: entering leader state: leader="Node at 10.1.0.3:4002 [Leader]"
2022-06-03T00:48:47.040Z [INFO]  raft: added peer, starting replication: peer=2
2022-06-03T00:48:47.044Z [INFO]  raft: pipelining replication: peer="{Voter 2 10.1.0.4:4002}"
[store] 2022/06/03 00:48:49 received request from node with ID 2, at 10.1.0.4:4002, to join this node
[store] 2022/06/03 00:48:49 node 2 at 10.1.0.4:4002 already member of cluster, ignoring join request

Contributing guidelines

JamesD4 commented 2 years ago

For clarity, I am attempting to set up Netmaker in a Highly Available configuration on Bare Metal / VMs. I have tried setting SQL_CONN to http://IP:4001 and setting database to rqlite. Thank you!

MinDBreaK commented 2 years ago

I case anyone need it, the correct format is the following :

SQL_CONN=<http|https>://[<user>:<password>@]<hostname>[:<port>]/

Also one of the issue making it unable to connect was that the raft url changed between two configurations and thus, making it fail too. It could have more debug infos, what helped me was to tcpdump the http requests made to the server.

JamesD4 commented 2 years ago

That's great, thanks @MinDBreaK. I'll give this a go!

afeiszli commented 2 years ago

Cloding. We had issues with previous versions of HA that should be resolved in most recent version.