mullvad / mullvadvpn-app

The Mullvad VPN client app for desktop and mobile
https://mullvad.net/
GNU General Public License v3.0
5.09k stars 338 forks source link

Wireguard network routing causing totally wrong locations, thus slow and laggy, perhaps only via Los Angeles portal #1735

Closed narration-sd closed 4 years ago

narration-sd commented 4 years ago

Issue report

Operating system: Noted on Android 7.1.1 app; routes checked via Win10 tracert

App version: was 2020.4 beta 3 on Android

Issue description

For some days, I had very slow and laggy connections about half the time. Investigation of the IP addresses showed that though I connected stating a Los Angeles location, the bad IPs resulting (read from your dropdown on screen) were clearly European. Reported by Fing to be Belgrade, but that's their monitoring location, likely somewhere near the actual exit port.

I'll attach a trace of one such -- from experience, the latency crossing the US is evident, then the cross-Atlantic jump very typical -- and then some descent into European routers.

I suspect this (and the blanked out servers in your app list) have been due to difficulties surrounding the posted Los Angeles servers switchover. This lasted much longer than planned, and to have routing difficulties, not a big surprise either. From what I see by now, these may have been fixed, and I'm not getting slow connections now, it appears.

What I would strongly suggest is that you install reliable full-network constant monitoring. Routing errors are endemic on the Internet - router reinstalls, technician errors, mis-plugged cables; this is something that can never be fully stable, and so you have to catch and repair the errors as they occur. I will give that it's much better than some years ago, but again, really is a constant monitor to verify situation, professionally.

As a side but important point to mention, I will note that it appears that your entire network in North America may have had packet shaping applied, to limit connection speeds, from time to time as of approximately the end of last week.

This morning pre-business hours I was getting only 20Mbps (or much less on some) speeds from major portals, entirely consistently as I tested for some minutes among them,, where I was getting most of an up-to-40Mbps actual ISP connection speed also measured at the same time.

Not sure why you would do this, as it doesn't make you look good for reviews, never mind we customers...but now I checked, and speeds are back up again in mid-morning, not so much loss off ISP-only. I mention packet shaping, as the profile appeared to show that with time constant

Finally, here's the traceroute for the typical fail of the kind this issue is raised for, the 'Belgrade' connection, as shared with Sanny.

The actual location should have been as set and reported, Los Angeles, and only circa 25ms latency, as I measure very often on correct connections. Again, this may not be occuring now, but you can see how definite the case is when it does:

Tracing route to 89.49.90.93 over a maximum of 30 hops   1 48 ms 67 ms 49 ms 10.64.0.1 2 50 ms 50 ms 51 ms static-198-54-128-33.cust.tzulo.com [198.54.128.33] 3 50 ms 49 ms 55 ms 174.128.248.177 4 52 ms 55 ms 52 ms 10.0.0.5 5 55 ms 50 ms 52 ms 38.32.93.1 6 50 ms 57 ms 50 ms be2180.ccr22.den01.atlas.cogentco.com [154.54.26.165] 7 72 ms 69 ms 68 ms be3036.ccr22.mci01.atlas.cogentco.com [154.54.31.90] 8 79 ms 74 ms 73 ms be2832.ccr42.ord01.atlas.cogentco.com [154.54.44.170] 9 86 ms 86 ms 85 ms be2718.ccr22.cle04.atlas.cogentco.com [154.54.7.130] 10 90 ms 86 ms 88 ms be2994.ccr32.yyz02.atlas.cogentco.com [154.54.31.234] 11 107 ms 95 ms 92 ms be3260.ccr22.ymq01.atlas.cogentco.com [154.54.42.90] 12 205 ms 207 ms 205 ms be3043.ccr22.lpl01.atlas.cogentco.com [154.54.44.165] 13 203 ms 204 ms 302 ms be2183.ccr42.ams03.atlas.cogentco.com [154.54.58.70] 14 205 ms 204 ms 201 ms be2440.agr21.ams03.atlas.cogentco.com [130.117.50.6] 15 200 ms 200 ms 200 ms be2114.rcr21.dus01.atlas.cogentco.com [130.117.48.62] 16 210 ms 209 ms 212 ms xe-2-1-1-0.dus3-j.mcbone.net [149.6.138.178] 17 Request timed out. 18 Request timed out. 19 Request timed out. 20 Request timed out. 21 Request timed out. 22 Request timed out. 23 Request timed out. 24 Request timed out. 25 Request timed out. 26 Request timed out. 27 Request timed out. 28 Request timed out. 29 Request timed out. 30 Request timed out.   Trace complete.

faern commented 4 years ago

This issue is not app related. Please contact our support for infrastructure related inquiries.

narration-sd commented 4 years ago

All right, and I've had a nice note back from Sanny.

I'm fully recognisant of the usual uses of GitHub issues. However, this routing thing is/was serious. You don't want marks on your reputation for this.

I felt and feel to bring it to deeper technical attention because: -- you are in a much better position to realize the seriousness and get action appropriate to it -- there is no way to reach the server/network experts in your organization -- it's really hard to get information recognized at first level technical support, surely you appreciate that, when it's about things they will have confidence 'just work' -- and often need to offer that confidence. -- the proper solution to such networking/routing/server failures is a robust and fully independent monitoring arrangement. You by evidence don't have that. This is a highly technical matter, and so again I reach to the most technically aware point that can be communicated with. -- I depend on you recognizing the impact to the competency of your application, and so you will not just pass on, but become directly engaged in assuring your company as a group make the needful happen. Others will thank you, including of course Mozilla -- who may be able to help even on the monitoring arrangement. It would be a good discussion, no??

I'm sure you understand what I'm getting at here, so won't belabour it.

faern commented 4 years ago

I assure you that the emails to support are read by people with more than enough knowledge on the topic. This team is working on the app and don't have the appropriate knowledge or access privileges to debug infrastructure and routing issues.

The IP you are tracerouting does not belong to any of our servers, and it is indeed a European IP. So I'm not sure what you are trying to show really.

Again, support will be able to help you better. But please provide more actual data about the problem. And I mean data, not descriptions. For example, what LA server were you trying to connect to that caused an issue?

narration-sd commented 4 years ago

Thanks, Linus - and I just got a good note and replied to it from Richard.

What I showed by the tracerouted connection: I was set to Los Angeles exit point for Mullvad, and kept getting quite evidently European/Mittleeuropa exit points actually, The traceroute sent evidences one of those. This misrouting occured circa 50% or better for a few days. It's a routing error, at some level, and that's why I intended your network support team to learn of it, not just make a fix as they evidently finally did, but to realize they need constand monitoring that will catch such as this. Such errors are endemic on the internet, as you may know.

Thanks, and a nice evening to you there, Clive

faern commented 4 years ago

I'm not sure how tracerouting a European IP proves anything about where you exit. Of course tracing to Europe will cause packets to leave LA. If you want to see if you exit in LA you need to traceroute an IP in LA and verify no hop leaves LA too far.

Lectures on how the internet works and the need for monitoring please be waived.

narration-sd commented 4 years ago

Well, no lecture intended, save about the monitoring need point, which I'm seeing still not quite understandbly in limbo on another conversation you indicate you'll see, so won't belabour here. What I did intend was a nod actually to your knowledge...

I'm not sure how tracerouting a European IP proves anything about where you exit.

Well, in my own internet/vpn understanding, it goes like this, some roughness allowed:

This immediately showed what I've communicated to support, and then you, via the example trace. The location is across US and across Atlantic by latency jumps, and then descends into the snarl of Euro routings, which are better than in the days they often looped through London, but still don't report enough to be confident where they are. I was also making basic measurements using Fing, and it chose its own distant portal to do that, labelling it (not that I necessarily believe) Belgrade.

All of these added up to the same thing. Mullvad was consistently (30-60% of the time) setting me up with a tunnel to the eastern side of Europe, when I'd asked for one to Los Angeles. After a few days, it looks like it no longer does this. But.

Here is the point where active, consistent, 24/7 route monitoring seems essential to ground the service you offer. These particularl routing errors may well have been because of the server switching that continued/s over two weeks 'at' Los Angeles, but for that kind of case, and all the others (hence my tip towards experienced knowledge with the internet), another hat I wear tends to sense that you need the monitoring as much as cars need winter tires in Scandinavia, and all the other things that make comfort and survival in the elements there. It's a good psychological position, not so? And practical.

Ok, you can understand that in my dotage (you'd be surprised) stories are a big interest, but I also hope they help us gett the boat on even keel here. Thanks, Linus, for your patience and attentive interest; I do feel to appreciate them, and we don't all have to be experts on everything to have all our accurate senses of situations keep developing, at least I think :)

Best, Clive and now tell me...??

narration-sd commented 4 years ago

Oh, and if you may be interested, one of the things I am very happy to have Mullvad for is taking looks at how cloud service architecture I'm rapidly putting together is progressing, for an effort out of a European university to make 3D-print designs for covad-critical medical equipment wherever persons can build for and help themselves in the world. You may know there can be surprises as to abilities of this kind, not that we should be anymore surprised. I'm consultng at several levels there, but my hand in certain software has appeared appropriate, so I do it...the hybrid app side, Vue etc., also part of this, so possibly relates to your present bailiwick..

faern commented 4 years ago

Exit and entry IP of our tunnels are on the same server. Given that you have not enabled bridge mode in the app of course. The server selection is way simpler than you seem to believe. Every server we have has two IPs, one entry and one exit IP. They also have a location. The server selection algorithm in the app then simply picks a random server in the location you have selected. If you select Los Angeles it picks a random server in Los Angeles. It then establishes a VPN tunnel to the in IP of that server. When you send traffic through this tunnel it will exit to the internet on the exit IP of than very same physical server. Hence, in LA still.

There is nothing smart about selecting an in IP best from your ISPs point of view, nor is it any agreement between app and servers about what exit to pick. It's way simpler than that and it's all in the app. It just selects a single server randomly within the location you have selected, and it creates a tunnel directly to that server. Done.

The traceroute you provided is to an IP not in any way related to Mullvad. And yes, it's in Europe. That does not prove or disprove anything about Mullvad.

narration-sd commented 4 years ago

@faern Linus, here's the fresh report, as passed to Richard. I think you'll understand what kind of failure actually happened, and that M247 in the locus of Mullvad is probably the source of the problem.

My real point is that Mullvad gain the security of pro-actively monitoring and catching such things. The commentary should show motivation, and an initial vendor possibility or so to start looking.

Thanks, and am sure you appreciate the responsibility here, and will know who should see this -- Clive

Report follows, and I'm attaching the same materials from the measurement:

Hello Richard, and good also.

A little more rested this early morning, and so insight wanted one more look, as follows. I went back to the actual report I'd made to your support, and the details are there -- with something of a precise and actual answer for what we've been looking at, I think.

So, the misrouting has all along apparently been from my ISP to your entry IP.  Just exactly who is responsible for this gets into aspects of internet/backbone routing I'm not fully qualified to judge. Whether it's the Cox ISP's fault, or something about how M247 reported their IP to be routed -- that's my point -- isn't clear, nor may not be discoverable from just this traceroute.   What does jump out at me is that the M247 entity listed in the Arin report is in Romania. And I was getting various reports in the time of these problems purporting my location to be in Romania.  So there is other indication of  quite wrong about routes.

Could you then have caught this, via the kind of monitoring I suggest?  Caught your portal service provider misrouting on their own, as it may appear happened? 

I'm allowing it's possible that the Cox ISP misrouted; they're full of issues enough, but the smoking fire here, the Romania location, and the fact that it's M247 as a destination that was getting consistently misrouted over a several week period when they were in fact switching around server of that 'location' for you, does seem to lay the problem at their doorstep.

And in that case, yes:  an external monitor would have the same experience I did:  try to connect on its constant tests, from any ISP to your incoming server in Los Angeles, instead get led around the barn to Europe. Alarms should then immediately go up.

I can't quite fathom what you are saying about this kind of monitoring being a difficulty -- it would seem the most natural thing for you at Mullvad to have.  Maybe I am missing a point somewhere?

As far as 'we would have seen it complained from others', I think not at all necessarily so.  Persons don't complain nearly as much for a soft failure, as this was, as they do for a hard one.  

If you didn't connect at all, bang, phone/DownDetector reporting jumps off the hook. If you connect through Robin Hood's forest, slow, laggy, because it's going around the world, they just think it's a bad normal -- and downgrade their opinion of Mullvad as a 'slow VPN'.  That's what they saw, isn't it. As I did, but didn't accept it.

What sort of solution might be available?  I made a quick survey, where of course it's hard to spot the serious large-network providers among the myriad of lets-monitor-your-own-router-hardware plays. 

Richard, I don't know your exact sphere of responsibility, but I think you'll understand what the analysis above shows really happened, and reflected directly on Mullvad even though it was likely your vendor's fault. 

I hope you can pass this note up to someone appropriate in responsibility for your worldwide service, if that may not be you.

Thanks a lot for entering into the conversation, and best where you are, sincerely.

Clive Steward

here's the traceroute: richard.tracert.txt

this is the Arin report on ownership of both IPs involved, Mullvad in and out for Los Angeles ARIN Whois_RDAP - American Registry for Internet Numbers.pdf

Finally the Mullvad screen during the episoce, listing those IP numbers: richard mullvad screen

narration-sd commented 4 years ago

edited the order and titling of attachments, so that comes out a little clearer

faern commented 4 years ago

Your screenshot shows 89.45.90.93, but the traceroute 89.49.90.93. Notice the 45 vs 49 in the second octet? You are not tracerouting our server. As I pointed out many posts ago: You are tracerouting some unrelated IP in Europe that has nothing to do with us.

narration-sd commented 4 years ago

...and... there's more to the detective story -- a wrong turn taken. Here's the note from Richard, and my reply.

As said, the issue did really happen, and for days, even if not for apparent reason. And I stand with thoughtfulness on what I propose so that you at Mullvad have the secure protection whenever any such things occur.

As may also from nefarious activity, persons fooling with BGP, etc., which is apparently an ongoing problem you want to be safe from, to degree possible.

Thanks, Linus, and as said, to Richard, Clive

Richard: Hello,   The IP address you posted in the screenshot is: 89.45.90.93   The IP address you did a traceroute to: 89.49.90.93 .   They are not the same IP-address.   Please confirm the results.   Best regards, richard

My reply:

Well, that's really right. And, 89.49.90.93 is in Hamburg, according to Arin, so that's why Linus talks about Germany.   So, for want of eyesight in a moment, I don't have evidence, except for those indicators which seemed to say my exit point was Romaina for a week, and now the problem's gone away, to all appearances.   I guess we will have to chalk it up to the days of server transfers, and leave it at that.   Professionally, I'd still be as interested in real route monitoring for Mullvad's secure knowledge about its 24/7 abilities -- this is quite separate from latency or other performance monitoring, I hope the links provided assure we're both looking at the same thing.   Thanks, Richard, and of course apology for the goose-chase portion of this...!   Clive