mozilla-mobile / mozilla-vpn-client

A fast, secure and easy to use VPN. Built by the makers of Firefox.
https://vpn.mozilla.org
Other
470 stars 113 forks source link

VPN disconnects immediately and breaks internet connection on device until rebooted #6169

Closed data-sync-user closed 1 year ago

data-sync-user commented 1 year ago

Hi there,

This specific user is having an issue with the VPN on Windows 10, VPN version 2.13 which we are unable to resolve with basic troubleshooting:

They activate the VPN by moving the toggle to the right, the toggle changes color and a small timer icon floats above it for about 1 second.  Then the toggle moves back to the off position.  There is no error message or any other indication from the VPN menu. Everything stops. Also, their PC is locked out of the internet connection.  They cannot access the internet with FireFox or any other browser.  The network connection icon at the bottom right of the screen shows the internet connection is off. The only way to restore the internet is to reboot the PC.

The app remains in off state:

!image-20230227-090010.png|width=179,height=326!

We have attempted the following steps with the customer:

Reinstall the app (signed out first)

Full reset of app by deleting the appdata folder (app was reinstalled with administrator rights)

check for any system updates on Windows 10 - all up to date

Run the app as administrator

Check Windows Firewall and all third party systems (antivirus, security software) to add an exclusion for the VPN, those same programs were also temporarily disabled completely

Check for other VPNs on device - not installed

They also temporarily uninstalled Malwarebytes

We tried another reset using the developer option of the app

We also checked if the mozilla vpn broker is running in the task manager (user feedback: The Mozilla VPN (broker) service was on & I restarted it; It was set at automatic; The Mozilla VPN (tunnel) was not on & I started it; It was set at manual; All in all, I get the same result.)

Logs attached

[^mozillavpn-2023-2-14 (2).txt]

Thank you for your help!

┆Issue is synchronized with this Jira Bug ┆Reporter: Magdalena Schwaighofer

data-sync-user commented 1 year ago

➤ Santiago Andrigo commented:

This is a fairly bad experience as it seems to brick the device until reboot, so upping this to High, at least until we figure out the root cause.

Magdalena Schwaighofer Can you confirm the user is able to use the internet as per normal if they don’t attempt the connect via the Mozilla VPN?

data-sync-user commented 1 year ago

➤ Magdalena Schwaighofer commented:

no problem I will reach out to the user! It seems like we had 2 other customers reporting the same issue by now.

Juan Zapata and Gustavo Aguilera do you have further info on your cases?

data-sync-user commented 1 year ago

➤ Owen Kirby commented:

Digging through the logs, it seems that the Windows kernel is reporting that the tunnel interface doesn’t exist when trying to populate the routing table and returns ERROR_NOT_FOUND (1168). Somehow this seem to suggest that we couldn’t figure out the LUID of the tunnel device during service bringup.

As a secondary issue, this means that our error handling in this case is insufficient and we leave the device in an unworkable state.

data-sync-user commented 1 year ago

➤ Betty Fleming commented:

Not considered a blocker for 2.14

data-sync-user commented 1 year ago

➤ Santiago Andrigo commented:

We are not considering this a blocker because a) it’s support low b) as per owen, we don’t think this is a regression in 2.13/2.14 so the support volume speaks to something that is not very prevalent. But we do want to keep it in high as the severity is pretty bad.

data-sync-user commented 1 year ago

➤ Lesley Norton commented:

Reminds me of https://mozilla-hub.atlassian.net/browse/VPN-1389 ( https://mozilla-hub.atlassian.net/browse/VPN-1389|smart-link )

data-sync-user commented 1 year ago

➤ Magdalena Schwaighofer commented:

We have received logs from a user experiencing this problem on version 2.14

[^mozillavpn-2023-4-21_1.txt]

data-sync-user commented 1 year ago

➤ Magdalena Schwaighofer commented:

another set of logs from a user who initially reported this on February 6th first (on version 2.13) they updated to version 2.14 and the same problem still occurs

[^mozillavpn-2023-4-25.txt]

data-sync-user commented 1 year ago

➤ Magdalena Schwaighofer commented:

recent report of customer experiencing this on 2.14. see log file below

[^mozillavpn-2023-5-16.txt]

data-sync-user commented 1 year ago

➤ Santiago Andrigo commented:

Basti do you have any thoughts on this one?

data-sync-user commented 1 year ago

➤ Basti commented:

That’s an “easy” one -

[15.05.2023 18:50:12.042] (WireguardUtilsWindows) Debug: Configuring peer XXXXXXXX via 68.235.44.2 [15.05.2023 18:50:12.043] (WireguardUtilsWindows) Debug: DATA: errno=0 [15.05.2023 18:50:12.043] (DnsUtilsWindows) Debug: Configuring DNS for MozillaVPN [15.05.2023 18:50:12.044] (WireguardUtilsWindows) Error: Failed to create route to XXXXXXXX result: 1168

From the logs you can see we enable the kill switch - but we fail to edit the route table, so vpn traffic actually will happen. My guess is, if that error happens, we fail to disable the killswitch, which blocks all traffic until our handle is lost, which happens at a reboot.

The general quesiton of “why the connection fails for this person” - no clue 😄

data-sync-user commented 1 year ago

➤ Santiago Andrigo commented:

Can we make this more fault tolerant and catch the error and undo the killswitch / tunnel?

data-sync-user commented 1 year ago

➤ Juan Zapata commented:

[^mozillavpn-2023-5-18-1.txt]

Attaching logs from another affected user

data-sync-user commented 1 year ago

➤ Valentina Virlics commented:

As QA was not able to reproduce this on previous versions, we cannot check the fix for this ticket.

data-sync-user commented 1 year ago

➤ Santiago Andrigo commented:

Basti Can you describe your fix here? Just curious. How would this be handled?

Marking this as Done and adding qa-not-actionable as a label. Hopefully if this creates regressions, we’ll notice in regression testing.

data-sync-user commented 1 year ago

➤ Basti commented:

Santiago Andrigo sure. The windows daemon has multiple “components” we need to activate in order to get to a windows connection: -> Wintun (create a network device) -> Wireguard (config that network device) -> Firewall ( make sure programs cannot self route) -> Routing ( tell windows to route everything to that adapter

The problem here is that we activate all of those components on activation and deactivate on deactivation. Now if one of those components detects an error, we abort the activation but that is not a “deactivation”, so all components we activated before still are active.

In this specific case, the firewall rules are enabled and we have aborted. Which results in a complete loss of internet, unless the vpn is on (aka the killswitch)

The solution is easy, just cross propagate an error signal between the components, so if we abort due to one error just ask all the other actors to tear down whatever they have.

data-sync-user commented 1 year ago

➤ Santiago Andrigo commented:

Fantastic, thanks Basti!