microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.32k stars 814 forks source link

Error when using terraform inside WSL2 #8022

Closed b1ackhawk-uh60 closed 1 month ago

b1ackhawk-uh60 commented 2 years ago

Version

Microsoft Windows [Version 10.0.19044.1503]

WSL Version

Kernel Version

Kernel version: 5.10.60.1

Distro Version

Ubuntu 20.04

Other Software

Terraform v1.1.5

Repro Steps

Run terraform refresh or any command that is doing a refresh (like plan/apply) in WSL2.

Expected Behavior

for terraform refresh to complete as usual. This was working as of 3 days ago.

Actual Behavior

Refresh does not complete and and error message is presented: │ Error: Unable to list provider registration status, it is possible that this is due to invalid credentials or the service principal does not have permission to use the Resource Manager API, Azure error: resources.ProvidersClient#List: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/{my-subscription-id}/providers?api-version=2016-02-01": dial tcp: lookup management.azure.com on 172.30.96.1:53: cannot unmarshal DNS message │ │ with provider["registry.terraform.io/hashicorp/azurerm"], │ on main.tf line 10, in provider "azurerm": │ 10: provider "azurerm" {

This consistently happens but only with the following conditions: Running in WSL2 with my primary ISP (Xfinity) and connected to either WiFi or Ethernet

I've tried swapping out my router for a different make/model - Issue still persists I've test on another computer, also outfitted with WSL2 (however running Ubuntu 18.04) - the issue persists I've tested using different DNS providers - the issue persists Also note that other tools seem to work fine (like azure cli) from WSL2, dns for managment.azure.com resolves fine (nslookup provides expected results)

Conditions where the issue does not persist and terraform operates normally: If I simply convert WSL2 to WSL1 - no issue, terraform operates normally If I run terraform from Windows (on the same machine) instead of WSL2 - no issue, terraform operates normally If I connect my computer via WiFi to my phones wireless hotspot - no issue, terraform operates normally If I connect to VPN in Windows - no issue, terraform operates normally

So it seems to be some combination of WSL2 and my ISP.

Diagnostic Logs

No response

sirredbeard commented 2 years ago

172.30.96.1 looks like an IP address assigned to the WSL2 instance, not a remote IP on Azure. I wonder why it is resolving management.azure.com to your local IP address.

Any change when you sudo rm /etc/resolv.conf, wsl.exe --shutdown, and restart WSL?

Or set generateResolvConf = false in your .wslconfig file, manually enter a non-ISP DNS server, eg 1.1.1.1 or 8.8.8.8, and restart as above?

Connecting to a VPN can cause issues. I wonder if connecting and then disconnecting from the VPN left your DNS is a broken state.

b1ackhawk-uh60 commented 2 years ago

@sirredbeard thanks for the response. edit deleting resolv.conf and allowing wsl to recreate did not change anything. However, turning off resolv conf generation and manually creating my own does work as a work around.

To clarify, this machine did not have any VPN configured previously. I only installed my preferred VPN software and configured as a troubleshooting step because of the issue I was having. Also, this issue started happening on two different machines on the same day. They both worked fine previously.

I believe 172.30.96.1 is the gateway, that WSL2 is just being nat'd behind

Also note that other tools seem to work fine (like azure cli) from WSL2, dns for managment.azure.com resolves fine (nslookup provides expected results)

Otimun commented 2 years ago

I can confirm I have the exact same issue. As mentioned you can work around the error by changing the /etc/resolve.conf file and adjust the nameserver to 1.1.1.1 or 8.8.8.8 instead of the IP address of your machine. (in my case 172.18.176.1) But something has changed over the last few days that has broken the use of terraform with regards to dns.

│ Error: Unable to list provider registration status, it is possible that this is due to invalid credentials or the service principal does not have permission to use the Resource Manager API, Azure error: resources.ProvidersClient#List: Failure sending request: StatusCode=0 -- Original Error: Get "[https://management.azure.com/subscriptions/(my-subscription id)/providers?api-version=2016-02-01":](https://management.azure.com/subscriptions//(my-subscription id)/providers?api-version=2016-02-01%22:) dial tcp: lookup management.azure.com on 172.18.176.1:53: cannot unmarshal DNS message │ │ with provider["registry.terraform.io/hashicorp/azurerm"], │ on main.tf line 28, in provider "azurerm": │ 28: provider "azurerm" { │

A normal nslookup or dig still works when using 172.18.176.1 as a name server.

Adaptador de Ethernet vEthernet (WSL):

Sufijo DNS específico para la conexión. . : Vínculo: dirección IPv6 local. . . : fe80::c87d:1b12:8bd:318%81 Dirección IPv4. . . . . . . . . . . . . . : 172.18.176.1 Máscara de subred . . . . . . . . . . . . : 255.255.240.0 Puerta de enlace predeterminada . . . . . :

The IP mentioned is used on windows as your WSL adapter.

pduchnovsky commented 2 years ago

same problem here, everything was fine yesterday, today I get the error "cannot unmarshal DNS message" I don't see any windows updates in the past 24 hours, it's weird

sebastiansterk commented 2 years ago

Having the exact same problem. This is a huge blocker for us.

sebastiansterk commented 2 years ago

I was able to fix it (at least a workaround):

1. Turn off generation of /etc/resolv.conf

Using your Linux prompt, open /etc/wsl.conf an paste the following content

[network]
generateResolvConf = false

2. Restart WSL

In Powershell run:

wsl --shutdown

3. Create a custom /etc/resolv.conf

Delete the /etc/resolv.conf:

rm -f /etc/resolv.conf

Create a new resolv.conf with the following content

nameserver 8.8.8.8

4. Restart WSL

In Powershell run:

wsl --shutdown

Open WSL --> issue is fixed (at least for me)

melsigl commented 2 years ago

I had the same issue as of today, and I can confirm that the workaround proposed by @sebastiansterk did work splendidly.

0Downtime commented 2 years ago

I also am having this issue starting mid day yesterday while working on some terraform code. Anyone have any idea what the root cause is? @sebastiansterk your fix also worked for me, thanks!

I don't know if anyone else can confirm this, but my firewalls DNS is pointed to 1.1.1.1 with forced DoT. Not sure if that is a contributing factor?

cheeseburger12 commented 2 years ago

thank you @sebastiansterk . Your workaround worked for me. I have been searching all day

hyzza commented 2 years ago

This particular workaround poses a problem for those who need to use some VPN in windows and resolve internal vpn addresses from WSL linux. DNSmasq could solve this by routing requests according domain as needed in wsl but this is quite a heavy weight solution.

hyzza commented 2 years ago

A colleague of mine have found a pretty elegant solution to this:

echo -e "nameserver IP.OF.DNS.SERVER\ntimeout: 1" >> /etc/resolv.conf

where IP.OF.DNS.SERVER is IP of a DNS server which allows TCP DNS resolving, 8.8.8.8 for example

or adding to /etc/wsl.conf

[boot] command="echo \"nameserver IP.OF.DNS.SERVER\ntimeout: 1\" >> /etc/resolv.conf"

This way worst case scenario is 1s delay when DNS TCP resolving is not successuful via primary (windows) dns.

KaremCBC commented 2 years ago

I'm having the same problem since yesterday, but unfortunately the solution from @sebastiansterk didn't work for me, on 3 separate WSL2. Please help!

Update: az logout / login was needed in order the solution to work!. Thanks @sebastiansterk

bernardmaltais commented 2 years ago

I also had the issue. Changing the DNS to 8.8.8.8 solved it. It was driving me nuts.

mohamed-elbeltagy commented 2 years ago

I have the exact same issue, setting DNS to 8.8.8.8 fixed it.

simonesavi commented 2 years ago

Same problem. Setting DNS to 8.8.8.8 fixed it, but I can confirm that DNS resolution in VPN stops to work

vladimir-shopov commented 2 years ago

Changing the DNS server to Google's is not a solution, but a workaround. There are times when you need to use a private DNS server.

This seems to be yet another side effect of #5806. Wondering when Microsoft will finally understand the huge impact this particular bug has on all WSL2 users and fix it.

msbenz commented 2 years ago

Same problem

Super jank (and very temporary) workaround until there's a true fix: grab an IP for management.azure.com and add an entry to /etc/hosts (in my case, it's currently 40.71.13.226)

echo "$(dig management.azure.com | grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}$") management.azure.com" >> /etc/hosts

Maintains all other DNS configs etc and allows terraform to auth/deploy.

pduchnovsky commented 2 years ago

This works as a workaround that is persistent

Run this and restart wsl using powershell.exe wsl --shutdown directly from wsl.

This will ensure that 9.9.9.9 nameserver will be added to /etc/resolv.conf and change dns timeout to 1 second.

Fully automatic and does not break connections over vpn on the windows.

sudo bash -c "cat >> /etc/wsl.conf <<EOF
[boot]
command = printf 'nameserver 9.9.9.9\ntimeout: 1' >> /etc/resolv.conf
EOF"
martinjoshua commented 2 years ago

Also experiencing this issue today.

masonhuemmer commented 2 years ago

Same here. This has impacted our entire team.

ImIOImI commented 2 years ago

Disabling resolve.conf and using a public DNS server didn't work for me. I suspect this is because we define private endpoints to get to private resources while on the VPN and those addresses aren't resolved correctly when using a public server.

b1ackhawk-uh60 commented 2 years ago

Disabling resolve.conf and using a public DNS server didn't work for me. I suspect this is because we define private endpoints to get to private resources while on the VPN and those addresses aren't resolved correctly when using a public server.

using a public dns server would prevent you from resolving dns on a private network. Alternatively, instead of using a public dns server for name resolution, you use the dns server of your private network.

or

You could modify the host file in windows with an entry for management.azure.com as mentioned here (thanks @AaronFriel for mentioning this issue there): https://github.com/golang/go/issues/51127#issuecomment-1035018244

bernardmaltais commented 2 years ago

Here is something that could help some. I added the following to my alias file:

sudo bash -c "sed -i '/management.azure.com/d' /etc/hosts" ; sudo bash -c 'echo "$(dig management.azure.com | grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}$") management.azure.com" >> /etc/hosts'

And simply call fixdns before running terraform commands.

sudo bash -c "cat >> /etc/wsl.conf <<EOF [boot] command = printf 'nameserver 9.9.9.9\ntimeout: 1' >> /etc/resolv.conf EOF"

Work very well. Best workaround so far.

ImIOImI commented 2 years ago

Here is something that could help some. I added the following to my alias file:

sudo bash -c "sed -i '/management.azure.com/d' /etc/hosts" ; sudo bash -c 'echo "$(dig management.azure.com | grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}$") management.azure.com" >> /etc/hosts')

And simply call fixdns before running terraform commands.

sudo bash -c "cat >> /etc/wsl.conf <<EOF [boot] command = printf 'nameserver 9.9.9.9\ntimeout: 1' >> /etc/resolv.conf EOF"

Work very well. Best workaround so far.

This helped a lot. I think you've got an extra ) at the end of your bash command, though

bernardmaltais commented 2 years ago

This helped a lot. I think you've got an extra ) at the end of your bash command, though Thanks, fixed.

angelbulas commented 2 years ago

Also experiencing this issue since yesterday

jsmith-speedeon commented 2 years ago

This is really weird.

Using the azure cli (az login, az group list, etc.) all works fine with the default DNS stuff. But terraform plan fails as everyone else is reporting.

Setting it to either my local network DNS resolver or my VPN DNS resolver everything works fine. This was all being handled automatically before this week.

Wonder what changed to specifically cause terraform plan/apply/refresh to break during DNS resolution.

bernardmaltais commented 2 years ago

Wonder what changed to specifically cause terraform plan/apply/refresh to break during DNS resolution.

This seems to be yet another side effect of https://github.com/microsoft/WSL/issues/5806. Wondering when Microsoft will finally understand the huge impact this particular bug has on all WSL2 users and fix it.

AaronFriel commented 2 years ago

@bernardmaltais I actually dug a bit deeper into this, and it appears that the Internet Connection Sharing DNS server does not use "message compression" (https://datatracker.ietf.org/doc/html/rfc1035#section-4.1.4) even when the upstream DNS server does. That causes the response size to be larger than the original, which isn't always correctly handled.

I don't want to declare mission accomplished too soon, but I'm now tracking https://github.com/golang/go/issues/51153 which may land as a fix in Go for 1.18 and backported to previous versions.

tenletters10 commented 2 years ago

Been spending a few days troubleshooting then and identified it was DNS with WSL2 causing it. After that found this thread. Same problem here. I need to use Private DNS that comes from a VPN and public DNS resolution at the same time. @b1ackhawk-uh60 b1ackhawk-uh60 agree with your comments you have shared so far about just using some public DNS server is not a valid solution.

AaronFriel commented 2 years ago

It looks like the Go team has a systemic fix slated for inclusion with 1.18 this month and the next point releases, but I can't speak to their release schedule.

https://github.com/golang/go/issues/51153#issuecomment-1040811353

rezarms commented 2 years ago

It started for me today and azure cli is working fine. I had to change ns in /etc/resolve.conf

HumanPrinter commented 2 years ago

Here is something that could help some. I added the following to my alias file:

sudo bash -c "sed -i '/management.azure.com/d' /etc/hosts" ; sudo bash -c 'echo "$(dig management.azure.com | grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}$") management.azure.com" >> /etc/hosts'

And simply call fixdns before running terraform commands.

sudo bash -c "cat >> /etc/wsl.conf <<EOF [boot] command = printf 'nameserver 9.9.9.9\ntimeout: 1' >> /etc/resolv.conf EOF"

Work very well. Best workaround so far.

@bernardmaltais This command works like a charm and is less intrusive than changing the resolv.conf. However, I'm having some trouble adding this to my bash_aliases file. Could you please share your entry including any escaped characters?

rlees85 commented 2 years ago

Please be aware this does not just affect WSL! I have this problem on Linux also. My DNS path also includes a step that goes via. DoT/DoH so I suspect this might be a common factor.

The workarounds posted 'work' but changing the DNS path in my opinion is not really a 'workaround' unless extremely desperate. The issue (that I suspect is with Go) needs proper attention

davidshen84 commented 2 years ago

Is it something new? I am pretty sure my TF scripts worked in my WSL2 environment before, until today...

rezarms commented 2 years ago

For me changing /etc/wsl.conf and setting generateResolvConf = false in /etc/wsl.conf didn't help. After hours it gets reset.

IskanderNovena commented 2 years ago

The Azure CLI has the same issue. When trying to log in with a Service Principal, I get an error stating that there are no subscriptions. When running the same command with the CLI in PowerShell, I get a normal response. Command used: az login --service-principal -u "<appId>" -p "<password>" --tenant "<tenantId>"

megakid commented 2 years ago

Same here.

kaancfidan commented 2 years ago

This works as a workaround that is persistent, Run this and restart wsl powershell.exe wsl --shutdown, this will automatically add 9.9.9.9 as additional nameserver to /etc/resolv.conf and change dns timeout to 1 second, fully automatic and does not break connections over vpn on the windows.

sudo bash -c "cat >> /etc/wsl.conf <<EOF
[boot]
command = printf 'nameserver 9.9.9.9\ntimeout: 1' >> /etc/resolv.conf
EOF"

Worked like a charm.

rezarms commented 2 years ago

The workaround doesn't work for me. If I add generateResolvConf = false, after shutting down and starting the wsl no file is created and I I remove the line the workaround doesn't do anything and still I get autogenerated resolve.conf file

moneygit commented 2 years ago

The sebastiansterk workaround worked for me. Had two machine, same Windows build and all. One working and one not.

epomatti commented 2 years ago

Changing to Google DNS fixed my issue as well.

Not the first time WSL2 default DNS gives me annoying issues.

timmyreilly commented 2 years ago

Just adding another wrinkle, it was working for me for a second, but then I logged in with a Service Principal az login --service-principal --username $ARM_CLIENT_ID --password $ARM_CLIENT_SECRET --tenant $ARM_TENANT_ID and it broke again.

ImIOImI commented 2 years ago

@bernardmaltais This command works like a charm and is less intrusive than changing the resolv.conf. However, I'm having some trouble adding this to my bash_aliases file. Could you please share your entry including any escaped characters?

Obviously, I'm not Bernard, but here is the exact code I have in my .zshrc file. I didn't set it up as an alias.

fixdns() {
  command sudo bash -c "sed -i '/management.azure.com/d' /etc/hosts" ; sudo bash -c 'echo "$(dig management.azure.com | grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}$") management.azure.com" >> /etc/hosts'
}
williamohara commented 2 years ago

confirmed that @sebastiansterk 's solution worked for me - I was about ready to throw in the towel - give up technology all together and move to the woods to live off the land. I had stepped away from coding my project for a time and i guess a windows update did it. is there any ticket elsewhere for a resolution?

ndamulelonemakh commented 2 years ago

I was able to fix it (at least a workaround):

1. Turn off generation of /etc/resolv.conf

Using your Linux prompt, open /etc/wsl.conf an paste the following content

[network]
generateResolvConf = false

2. Restart WSL

In Powershell run:

wsl --shutdown

3. Create a custom /etc/resolv.conf

Delete the /etc/resolv.conf:

rm -f /etc/resolv.conf

Create a new resolv.conf with the following content

nameserver 8.8.8.8

4. Restart WSL

In Powershell run:

wsl --shutdown

Open WSL --> issue is fixed (at least for me)

I can confirm that this worked for me as well. @sebastiansterk Thanks for saving me time:)

surlypants commented 2 years ago

the only "fix" i have found is to downgrade to WSL1. every other suggestion has only provided temporary / non-persistent (if any) relief

AaronFriel commented 2 years ago

@surlypants A recent build of terraform should fix this, but terraform providers will need to be built on a recent version of Go.

migldasilva commented 1 year ago

Every solution proposing using the [boot] section on /etc/wsl.conf file are available only for Windows 11 and Server 2022.

https://learn.microsoft.com/en-us/windows/wsl/wsl-config#boot-settings

newbenji commented 1 year ago

im using terragrunt in docker container in wsl2 and had the issue with slow terraform running. Giving --dns=1.1.1.1 to the pod changed so its fast again