netdata / msi-installer

Netdata installer for Windows using WSL2
GNU General Public License v3.0
20 stars 2 forks source link

Installation issue w/ release 2022-11-09133750 #26

Open hugovalente-pm opened 1 year ago

hugovalente-pm commented 1 year ago

Was trying to install netdata and I guess this is the release used https://github.com/netdata/msi-installer/releases/tag/2022-11-09133750 (based on date/time)

Steps:

  1. I had removed my previous netdata.msi installation but I had the wmi_exporter already installed
  2. Downloaded the netdata.msi mentioned above
  3. Ran it in and Admin PowerShell with msiexec.exe /i C:\Users\hugoj\netdata\netdata.msi TOKEN=<space-token> URL=https://app.netdata.cloud
  4. Saw the PC reboot
  5. Installation tried to resume and was stuck at REGISTERING NETDATA DISTRO WITH WSL2
hugovalente-pm commented 1 year ago

On the logfile netdata.log I see image

dfpr commented 1 year ago

Possibly https://github.com/microsoft/WSL/issues/8714 Please attach WSL logs using these instructions https://github.com/Microsoft/WSL/blob/master/CONTRIBUTING.md#8-detailed-logs

hugovalente-pm commented 1 year ago

could you share with me your e-mail? not sure these logs are safe to be shared public my email: hugo@netdata.cloud

dfpr commented 1 year ago

could you share with me your e-mail? not sure these logs are safe to be shared public my email: hugo@netdata.cloud

Sent you an email.

hugovalente-pm commented 1 year ago

some updates here, the main issue identified seemed to be caused by having another image using the port 19999. not sure this is an issue that can be surfaced to the user

after stopping that other image this installation went ahead but node wasn't successfully claimed to Netdata Cloud due to not being able to reach api.netdata.cloud the solution was to restart the PC and entering the Netdata image running the claiming script netdata-claim.sh -token=<space-token>

nodes was claimed, as it can be seen on the image below, but I'm not being able to get the node connected to Cloud get errors on ACLK image

image

hugovalente-pm commented 1 year ago

this seems to be related with default DNS, checking the content of /etc/resolv.conf where nameserver is my IPv4 Address image

looked to another image that has the following image

this was gotten from https://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution @dfpr is this something we need to consider while installing/setting up the image?

dfpr commented 1 year ago

@hugovalente-pm can you confirm it is just the DNS by pinging an IP address? also, the wsl import now switchs to wsl1 if takes more than 2 minutes. And with the MSI argument WSL=1 that will be used. Both dns and import issues appear to be related directly to WSL and not the installer.

hugovalente-pm commented 1 year ago

@dfpr it was the DNS I did a troubleshooting with some guys on slack to help identify this, I'll add a summary here

And with the MSI argument WSL=1 that will be used. Both dns and import issues appear to be related directly to WSL and not the installer.

Not sure if I follow here, if we install the Netdata image and there are some issues on the /etc/resolv.conf we can't solve them from this installation process. If that is the case can't we at least provide them with a tip to the article shared?

then I remembered I had seen and done this on the Ubuntu https://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution

dfpr commented 1 year ago

I can't reproduce your issue, the askubuntu article mentions issues when connecting a vpn so I can't pinpoint the exact solution proposed there, I don't know why your Ubuntu points to Google DNS servers, if I put

[network]
generateResolvConf = false

in /etc/resolv.conf DNS queries fail, manually putting google dns server fixes it but at startup the resolv.conf file is deleted. WSL can be affected by a lot of issues and putting them in the readme seems impractical.

Ferroin commented 1 year ago

@dfpr That needs to go in /etc/wsl.conf, not /etc/resolv.conf.

dfpr commented 1 year ago

Sorry, typo, I did put the lines in the right place but WSL deletes the file, putting an immutable flag created a lot of issues for the docker hostb when building, I'll try after importing. Again, this is a wsl issue not coming from the installer.

dfpr commented 1 year ago

Latest commit should fix dns issue

cakrit commented 1 year ago

Now we can no longer access $(hostname).local @hugovalente-pm / @Ferroin The result is that the whole purpose of the installation is broken, as we get no wmi metrics (host unreachable).

Whatever the network issues were, resolv.conf was NOT the cause. I deleted /etc/wsl.conf and /etc/resolv.conf, restarted wsl and can happily access api.netdata.cloud, app.netdata.cloud AND $(hostname).local. @dfpr please revert this change.

dfpr commented 1 year ago

Now we can no longer access $(hostname).local @hugovalente-pm / @Ferroin The result is that the whole purpose of the installation is broken, as we get no wmi metrics (host unreachable).

Whatever the network issues were, resolv.conf was NOT the cause. I deleted /etc/wsl.conf and /etc/resolv.conf, restarted wsl and can happily access api.netdata.cloud, app.netdata.cloud AND $(hostname).local. @dfpr please revert this change.

I have reverted the change.

cakrit commented 1 year ago

@hugovalente-pm try the latest version and let's try and figure out why claiming doesn't work in your case, without changing DNS again. The installer should be left as is IMO.

hugovalente-pm commented 1 year ago

sure, will try a fresh install tomorrow

hugovalente-pm commented 1 year ago

@cakrit I was trying a fresh install and got this error which I thought it would mean the node wasn't claimed to Cloud (I tried it twice to make sure I hadn't miscopied the token), the command I ran

msiexec.exe /i netdata.msi TOKEN=<claim-token> URL=https://app.netdata.cloud

looking to the log file on c:\netdata.log I saw that it was claimed so restarted the agent and now see the node as Unseen

Connection attempt 1 successful
uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.
The claim was successful but the agent could not be notified (0)- it requires a restart to connect to the cloud.
STARTING AGENT
ADDING NETDATA TO STARTUP

Looking to the error.log on the agent I get

image

pinging api.netdata.cloud from the linux image works ok, pinging app.netdata.cloud doesn't but from the logs ACLK is trying to connect to api.netdata.cloud

image

dfpr commented 1 year ago

I can't ping either app.netdata.cloud or api.netdata.cloud from my host, not inside wsl. I also tried an online ping webpage and it couldn't ping them as well.

thiagoftsm commented 1 year ago

Hello @dfpr ,

Like you I cannot ping:

bash-5.2$ nslookup app.netdata.cloud
Server:         192.168.1.1
Address:        192.168.1.1#53

Non-authoritative answer:
Name:   app.netdata.cloud
Address: 54.198.178.11
Name:   app.netdata.cloud
Address: 44.196.50.41
Name:   app.netdata.cloud
Address: 44.207.131.212
app.netdata.cloud       canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.

bash-5.2$ ping -c 1 app.netdata.cloud
PING app.netdata.cloud (44.207.131.212) 56(84) bytes of data.

--- app.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

bash-5.2$ nslookup api.netdata.cloud
Server:         192.168.1.1
Address:        192.168.1.1#53

Non-authoritative answer:
Name:   api.netdata.cloud
Address: 54.198.178.11
Name:   api.netdata.cloud
Address: 44.196.50.41
Name:   api.netdata.cloud
Address: 44.207.131.212
api.netdata.cloud       canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.

bash-5.2$ ping -c 1 api.netdata.cloud
PING api.netdata.cloud (44.207.131.212) 56(84) bytes of data.

--- api.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

, but I can access host https://app.netdata.cloud. Can you at least access it?

cakrit commented 1 year ago

Don't try ping, try wget or curl. ICMP isn't necessarily enabled for servers nowadays.
@hugovalente-pm get Timo to help with the debugging. Get on a call with him and he'll figure out what's happening for sure.

dfpr commented 1 year ago

Hello @dfpr ,

Like you I cannot ping:

bash-5.2$ nslookup app.netdata.cloud
Server:         192.168.1.1
Address:        192.168.1.1#53

Non-authoritative answer:
Name:   app.netdata.cloud
Address: 54.198.178.11
Name:   app.netdata.cloud
Address: 44.196.50.41
Name:   app.netdata.cloud
Address: 44.207.131.212
app.netdata.cloud       canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.

bash-5.2$ ping -c 1 app.netdata.cloud
PING app.netdata.cloud (44.207.131.212) 56(84) bytes of data.

--- app.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

bash-5.2$ nslookup api.netdata.cloud
Server:         192.168.1.1
Address:        192.168.1.1#53

Non-authoritative answer:
Name:   api.netdata.cloud
Address: 54.198.178.11
Name:   api.netdata.cloud
Address: 44.196.50.41
Name:   api.netdata.cloud
Address: 44.207.131.212
api.netdata.cloud       canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.

bash-5.2$ ping -c 1 api.netdata.cloud
PING api.netdata.cloud (44.207.131.212) 56(84) bytes of data.

--- api.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

, but I can access host https://app.netdata.cloud. Can you at least access it?

HTTPS works for me.

underhood commented 1 year ago

Don't try ping, try wget or curl. ICMP isn't necessarily enabled for servers nowadays. @hugovalente-pm get Timo to help with the debugging. Get on a call with him and he'll figure out what's happening for sure.

where it seems to fail is at netdata connect_to_this_ip46 question of course is why

hugovalente-pm commented 1 year ago

with @underhood we were able to rule out the issue with DNS, since we got an IP resolution (if right or now we aren't sure), but the issue was reproduced with an emulation to an HTTP call like this https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE] and we weren't able to get a response from inside WSL but we got from my local computer.

@underhood will try to get is setup installed with WSL 2 (for some reason it got WSL 1) to further investigate this network issue that could be a config or a bug on WLS 2

underhood commented 1 year ago

basically to summarize agent fails to connect_to_this_ip46 in attempt to do GET HTTPS call as follows (replace things in [] with your data): https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE]

we tried to do wget https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE] on the affected machine (as this is exact thing agent tries to do when that error apears) and it could not connect too, same command works on other machines (gets response from cloud)

Therefore as wget seems to have same issue I consider this to be some network configuration issue or a bug in WSL2 and is not specific to netdata.

I also tried the msi installer in Win 11 in VirtualBox with WSL1 and cloud connection was working OK. I will try to figure out why WSL2 version was not used and try to see if WSL2 version will have the aforementioned issue.

cakrit commented 1 year ago

This is all very strange. Were you able to duplicate elsewhere @underhood? I have a different windows machine (laptop) I can try too

cakrit commented 1 year ago

Never mind, on Windows 10 it couldn't install with WSL 2 it says and it's reverting to WSL 1. So I can't do the test.

underhood commented 1 year ago

For some reason I cant make WSL2 to work in VM despite trying 100 things :/

cakrit commented 1 year ago

WSL2 in general doesn't work in VMs, the installer should default to WSL1 We have a closed issue on this.

I may have replicated the network issue on my WSL2 on a laptop I have with me. The PC at home had Win 11 and worked great, this one just doesn't want to work with app.netdata.cloud for some reason. I'll see if anything from above will help.

hugovalente-pm commented 1 year ago

@cakrit mine is WSL 2 and I bumped into this domain resolution fix ttps://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution

cakrit commented 1 year ago

Yes, that works, but there should be a way to properly resolve app.netdata.cloud without losing the capability to reach the windows host via $(hostname).local

From https://superuser.com/questions/1714002/wsl2-connect-to-host-without-disabling-the-windows-firewall I got the idea to exclude the interface from the windows firewall (see screenshot below) and that at least got rid of the message ** server can't find app.netdata.cloud: REFUSED

I now get the following:

DESKTOP-KQ81AL4:/mnt/c/Windows/system32# nslookup app.netdata.cloud
;; connection timed out; no servers could be reached

image

cakrit commented 1 year ago

Never mind, I tried to get the rest of it working and followed some instructions in https://gist.github.com/sivinnguyen/8bc0125b274250683a97e149cf270040 to do run some powershell commands in admin mode and reboot. After the reboot I saw a new IP in resolv.conf, but the firewall again blocking the connection and doing the same thing (unchecking WSL from the protected network connections) makes no difference.

I give up. This is clearly a shoddy implementation that only works occasionally. I have no idea how to get both the name resolution to work AND to get a URL that will let us access the windows_exporter metrics from inside WSL. The moment we change resolv.conf, we lose access to the /metrics endpoint and I have no idea how we can get to it. I found somewhere that if you type ip route, then the via that appears is the IP you can use to reach the windows host, but it didn't work.

If you can find a solution gents, let me know, but it needs to both allow claiming and show the metrics, not just one or the other. At this point, I'm even considering hard-coding an IP in /etc/hosts as a workaround, which is basically the same as accepting defeat.