Closed hannosch closed 9 years ago
I am not a fan of all these GeoIP approaches. While GeoIP is already technically inaccurate, you always have the time delay between measurements and reports that you mentioned. And additionally there is VPN. If I lived in China/Iran/YouNameIt, I'd have VPN configured for all traffic in my phone. Expect to see a lot of Chinese WiFi to appear in the US...
You have the most valuable and accurate position data included with the correct time stamp and I think you should put all efforts into exploiting just that. It's better than inventing workarounds for the N% false data.
edit: As a plus, when borders change over time (Crimea, Sudan, maybe Catalonia...), your data are not corrupted. If you did that reverse geocoding periodically for the existing data (like once in a year), you'd have your country assignments corrected for free.
@VolMi I think in this case the VPN's aren't a problem. For incoming data traffic we'd only use geoip as a hint, as to what country the data might be from. We can still cross-check that with the country bounding box of the lat/lon of the WiFi. For submission traffic we wouldn't discard the data, just not fill in the country value.
Using the cell towers mcc code won't have this problem, as the code really is from the country the WiFi network is in, and not the one from where you upload the data / connect to the service.
My hope is that both of these approaches combined, will get us a country value for most WiFi networks. It's computationally very cheap to do, and we don't need extra data files or more complicated GIS algorithms. We can add those later to fill in the hopefully small missing bits or reclassify data. So this is very much a "get us 80% of the way, with 20% of the effort" approach. And as long as this is only used for statistics, there's not much harm done if getting it wrong.
Possible alternative: tile-based reverse geocoding. We keep a table of (say) 1x1km tiles of earth and their associated country code. It's lazily populated. When we get a wifi, we round to the nearest tile and look it up. If match, we have a good guess at the country. If no match, we run a (once per tile) expensive-ish reverse lookup of that square against a relatively high-resolution shapefile, and cache the result. Don't even store it in the wifi: we can calculate country-aggregate statistics by selecting all the tiles in a country and then all the wifis in any tile, using bounding-box queries.
There are at most 150m such tiles on the land-surface, and I wager only a tiny percentage of them will ever be populated by our measurements. Most wifis will be in cities. Can do lower resolution (10x10km?) if that's still too much.
You should take borders into consideration
You can scan/see dutch wifi aps in germany for example if you are at the border.
There are cities directly at the border so that there can be Mord than one wifi.
You need to take a guess at the wifi aps center through triangulation.
We maintain a region code for all new and updated wifi networks now. We still have to process the backlog of all other wifi networks. Once that is done we can tackle the display side of this in #242.
I've processed the backlog during the last week. As of today we display WiFi stats at https://location.services.mozilla.com/stats/regions
As a follow-up to #221 we should be able to do country bounding box checks for wifi data. And it would be nice to be able to generate country level statistics for WiFi data. Maybe in the future we'll need to do sharding, and a country field would be a good fit there.
One way to do this would be to do reverse geocoding and guess the country iso code based on the lat/lon alone. This would require very accurate non-overlapping shapefiles for country data. I'm hesitant to add this, but maybe it's not as complicated as I fear.
I think in most cases we can trick and avoid this. Most of our submit traffic includes both cell and wifi data at the same time, and comes from a GeoIP that's from the right country. I think we can pass this data through from the outermost submit_view layer down into the tasks writing new wifi measures.
As a second approach our country bounding boxes uniquely identify one country for large parts of the world. So we could reclassify most of the current data based on this.
I'm not too worried with having a country code for each wifi measure, but only for the aggregated wifi tables. So we basically only need to get a correct match for each Wifi once, either via geoip/mcc or reverse bounding box checks. My hope is that these two approaches should cover most of our data.
The whole passing data down approach is a bit fragile, as there's often zero or multiple country codes involved or conflicting data (like someone submitting us data from her trip abroad). If we get non-unique data, I'd refuse to guess, but take mcc over geoip data.
This might need discussion or a different approach :)