Open hugovalente-pm opened 1 year ago
On the logfile netdata.log I see
Possibly https://github.com/microsoft/WSL/issues/8714 Please attach WSL logs using these instructions https://github.com/Microsoft/WSL/blob/master/CONTRIBUTING.md#8-detailed-logs
could you share with me your e-mail? not sure these logs are safe to be shared public my email: hugo@netdata.cloud
could you share with me your e-mail? not sure these logs are safe to be shared public my email: hugo@netdata.cloud
Sent you an email.
some updates here, the main issue identified seemed to be caused by having another image using the port 19999
. not sure this is an issue that can be surfaced to the user
after stopping that other image this installation went ahead but node wasn't successfully claimed to Netdata Cloud due to not being able to reach api.netdata.cloud
the solution was to restart the PC and entering the Netdata image running the claiming script netdata-claim.sh -token=<space-token>
nodes was claimed, as it can be seen on the image below, but I'm not being able to get the node connected to Cloud get errors on ACLK
this seems to be related with default DNS, checking the content of /etc/resolv.conf
where nameserver is my IPv4 Address
looked to another image that has the following
this was gotten from https://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution @dfpr is this something we need to consider while installing/setting up the image?
@hugovalente-pm can you confirm it is just the DNS by pinging an IP address? also, the wsl import now switchs to wsl1 if takes more than 2 minutes. And with the MSI argument WSL=1 that will be used. Both dns and import issues appear to be related directly to WSL and not the installer.
@dfpr it was the DNS I did a troubleshooting with some guys on slack to help identify this, I'll add a summary here
And with the MSI argument WSL=1 that will be used. Both dns and import issues appear to be related directly to WSL and not the installer.
Not sure if I follow here, if we install the Netdata image and there are some issues on the /etc/resolv.conf
we can't solve them from this installation process. If that is the case can't we at least provide them with a tip to the article shared?
in the Netdata image in terms of name resolution everything seems ok but as soon as I try to curl an endpoint it gets stuck
in another image Ubuntu I'm able to have a node connected to Cloud and the curls return a response
from one of the guys it resolves to a different IP 44.207.131.212
execution of curl
with -vvv
result of dig +trace api.netdata.cloud
hugo-pc:/mnt/c/Users/hugoj# dig +trace api.netdata.cloud
; <<>> DiG 9.16.33 <<>> +trace api.netdata.cloud
;; global options: +cmd
;; connection timed out; no servers could be reached
result of dig +trace @8.8.8.8 api.netdata.cloud
content of /etc/resolv.conf
on Netdata image
content of /etc/resolv.conf
on Ubuntu image
then I remembered I had seen and done this on the Ubuntu https://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution
I can't reproduce your issue, the askubuntu article mentions issues when connecting a vpn so I can't pinpoint the exact solution proposed there, I don't know why your Ubuntu points to Google DNS servers, if I put
[network]
generateResolvConf = false
in /etc/resolv.conf
DNS queries fail, manually putting google dns server fixes it but at startup the resolv.conf file is deleted. WSL can be affected by a lot of issues and putting them in the readme seems impractical.
@dfpr That needs to go in /etc/wsl.conf
, not /etc/resolv.conf
.
Sorry, typo, I did put the lines in the right place but WSL deletes the file, putting an immutable flag created a lot of issues for the docker hostb when building, I'll try after importing. Again, this is a wsl issue not coming from the installer.
Latest commit should fix dns issue
Now we can no longer access $(hostname).local
@hugovalente-pm / @Ferroin
The result is that the whole purpose of the installation is broken, as we get no wmi metrics (host unreachable).
Whatever the network issues were, resolv.conf
was NOT the cause. I deleted /etc/wsl.conf
and /etc/resolv.conf
, restarted wsl and can happily access api.netdata.cloud, app.netdata.cloud AND $(hostname).local. @dfpr please revert this change.
Now we can no longer access
$(hostname).local
@hugovalente-pm / @Ferroin The result is that the whole purpose of the installation is broken, as we get no wmi metrics (host unreachable).Whatever the network issues were,
resolv.conf
was NOT the cause. I deleted/etc/wsl.conf
and/etc/resolv.conf
, restarted wsl and can happily access api.netdata.cloud, app.netdata.cloud AND $(hostname).local. @dfpr please revert this change.
I have reverted the change.
@hugovalente-pm try the latest version and let's try and figure out why claiming doesn't work in your case, without changing DNS again. The installer should be left as is IMO.
sure, will try a fresh install tomorrow
@cakrit I was trying a fresh install and got this error which I thought it would mean the node wasn't claimed to Cloud (I tried it twice to make sure I hadn't miscopied the token), the command I ran
msiexec.exe /i netdata.msi TOKEN=<claim-token> URL=https://app.netdata.cloud
looking to the log file on c:\netdata.log
I saw that it was claimed so restarted the agent and now see the node as Unseen
Connection attempt 1 successful
uv_pipe_connect(): no such file or directory
Make sure the netdata service is running.
The claim was successful but the agent could not be notified (0)- it requires a restart to connect to the cloud.
STARTING AGENT
ADDING NETDATA TO STARTUP
Looking to the error.log
on the agent I get
pinging api.netdata.cloud
from the linux image works ok, pinging app.netdata.cloud
doesn't but from the logs ACLK is trying to connect to api.netdata.cloud
I can't ping either app.netdata.cloud
or api.netdata.cloud
from my host, not inside wsl. I also tried an online ping webpage and it couldn't ping them as well.
Hello @dfpr ,
Like you I cannot ping:
bash-5.2$ nslookup app.netdata.cloud
Server: 192.168.1.1
Address: 192.168.1.1#53
Non-authoritative answer:
Name: app.netdata.cloud
Address: 54.198.178.11
Name: app.netdata.cloud
Address: 44.196.50.41
Name: app.netdata.cloud
Address: 44.207.131.212
app.netdata.cloud canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.
bash-5.2$ ping -c 1 app.netdata.cloud
PING app.netdata.cloud (44.207.131.212) 56(84) bytes of data.
--- app.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
bash-5.2$ nslookup api.netdata.cloud
Server: 192.168.1.1
Address: 192.168.1.1#53
Non-authoritative answer:
Name: api.netdata.cloud
Address: 54.198.178.11
Name: api.netdata.cloud
Address: 44.196.50.41
Name: api.netdata.cloud
Address: 44.207.131.212
api.netdata.cloud canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com.
bash-5.2$ ping -c 1 api.netdata.cloud
PING api.netdata.cloud (44.207.131.212) 56(84) bytes of data.
--- api.netdata.cloud ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
, but I can access host https://app.netdata.cloud
. Can you at least access it?
Don't try ping
, try wget
or curl
. ICMP isn't necessarily enabled for servers nowadays.
@hugovalente-pm get Timo to help with the debugging. Get on a call with him and he'll figure out what's happening for sure.
Hello @dfpr ,
Like you I cannot ping:
bash-5.2$ nslookup app.netdata.cloud Server: 192.168.1.1 Address: 192.168.1.1#53 Non-authoritative answer: Name: app.netdata.cloud Address: 54.198.178.11 Name: app.netdata.cloud Address: 44.196.50.41 Name: app.netdata.cloud Address: 44.207.131.212 app.netdata.cloud canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com. bash-5.2$ ping -c 1 app.netdata.cloud PING app.netdata.cloud (44.207.131.212) 56(84) bytes of data. --- app.netdata.cloud ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms bash-5.2$ nslookup api.netdata.cloud Server: 192.168.1.1 Address: 192.168.1.1#53 Non-authoritative answer: Name: api.netdata.cloud Address: 54.198.178.11 Name: api.netdata.cloud Address: 44.196.50.41 Name: api.netdata.cloud Address: 44.207.131.212 api.netdata.cloud canonical name = main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com. bash-5.2$ ping -c 1 api.netdata.cloud PING api.netdata.cloud (44.207.131.212) 56(84) bytes of data. --- api.netdata.cloud ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms
, but I can access host
https://app.netdata.cloud
. Can you at least access it?
HTTPS works for me.
Don't try
ping
, trywget
orcurl
. ICMP isn't necessarily enabled for servers nowadays. @hugovalente-pm get Timo to help with the debugging. Get on a call with him and he'll figure out what's happening for sure.
where it seems to fail is at netdata connect_to_this_ip46
question of course is why
with @underhood we were able to rule out the issue with DNS, since we got an IP resolution (if right or now we aren't sure), but the issue was reproduced with an emulation to an HTTP call like this
https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE]
and we weren't able to get a response from inside WSL but we got from my local computer.
@underhood will try to get is setup installed with WSL 2 (for some reason it got WSL 1) to further investigate this network issue that could be a config or a bug on WLS 2
basically to summarize agent fails to connect_to_this_ip46
in attempt to do GET HTTPS call as follows (replace things in []
with your data):
https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE]
we tried to do wget https://api.netdata.cloud/api/v1/env?v=[NETDATAVERSIONWITHOUTvINBEGGINING]&cap=proto,ctx&claim_id=[CLAIMIDHERE]
on the affected machine (as this is exact thing agent tries to do when that error apears) and it could not connect too, same command works on other machines (gets response from cloud)
Therefore as wget seems to have same issue I consider this to be some network configuration issue or a bug in WSL2 and is not specific to netdata.
I also tried the msi installer in Win 11 in VirtualBox with WSL1 and cloud connection was working OK. I will try to figure out why WSL2 version was not used and try to see if WSL2 version will have the aforementioned issue.
This is all very strange. Were you able to duplicate elsewhere @underhood? I have a different windows machine (laptop) I can try too
Never mind, on Windows 10 it couldn't install with WSL 2 it says and it's reverting to WSL 1. So I can't do the test.
For some reason I cant make WSL2 to work in VM despite trying 100 things :/
WSL2 in general doesn't work in VMs, the installer should default to WSL1 We have a closed issue on this.
I may have replicated the network issue on my WSL2 on a laptop I have with me. The PC at home had Win 11 and worked great, this one just doesn't want to work with app.netdata.cloud for some reason. I'll see if anything from above will help.
@cakrit mine is WSL 2 and I bumped into this domain resolution fix ttps://askubuntu.com/questions/1403886/how-to-fix-wsl-domain-resolution
Yes, that works, but there should be a way to properly resolve app.netdata.cloud without losing the capability to reach the windows host via $(hostname).local
From https://superuser.com/questions/1714002/wsl2-connect-to-host-without-disabling-the-windows-firewall I got the idea to exclude the interface from the windows firewall (see screenshot below) and that at least got rid of the message
** server can't find app.netdata.cloud: REFUSED
I now get the following:
DESKTOP-KQ81AL4:/mnt/c/Windows/system32# nslookup app.netdata.cloud
;; connection timed out; no servers could be reached
Never mind, I tried to get the rest of it working and followed some instructions in https://gist.github.com/sivinnguyen/8bc0125b274250683a97e149cf270040 to do run some powershell commands in admin mode and reboot. After the reboot I saw a new IP in resolv.conf, but the firewall again blocking the connection and doing the same thing (unchecking WSL from the protected network connections) makes no difference.
I give up. This is clearly a shoddy implementation that only works occasionally. I have no idea how to get both the name resolution to work AND to get a URL that will let us access the windows_exporter metrics from inside WSL. The moment we change resolv.conf, we lose access to the /metrics endpoint and I have no idea how we can get to it. I found somewhere that if you type ip route
, then the via
that appears is the IP you can use to reach the windows host, but it didn't work.
If you can find a solution gents, let me know, but it needs to both allow claiming and show the metrics, not just one or the other. At this point, I'm even considering hard-coding an IP in /etc/hosts as a workaround, which is basically the same as accepting defeat.
Was trying to install netdata and I guess this is the release used https://github.com/netdata/msi-installer/releases/tag/2022-11-09133750 (based on date/time)
Steps:
netdata.msi
installation but I had the wmi_exporter already installedmsiexec.exe /i C:\Users\hugoj\netdata\netdata.msi TOKEN=<space-token> URL=https://app.netdata.cloud