Open waldoj opened 5 years ago
Looks like this is the query:
https://api.wigle.net/api/v2/network/search?onlymine=false&first=0&latrange1=37.72&latrange2=38.28&longrange1=-78.84&longrange2=-78.21&lastupdt=20150101&freenet=false&paynet=false&resultsPerPage=1000
Here's a cURL request:
curl -X GET "https://api.wigle.net/api/v2/network/search?onlymine=false&first=0&latrange1=37.72&latrange2=38.28&longrange1=-78.84&longrange2=-78.21&lastupdt=20150101&freenet=false&paynet=false" -H "accept: application/json" -u [API TOKEN] --basic
Great. Now I just need to iterate through the results. There are...uh...316,045 of them.
Here's a sample record:
{
"trilat": 37.73448563,
"trilong": -78.64722443,
"ssid": "linksys",
"qos": 0,
"transid": "20170723-00000",
"firsttime": "2017-07-22T16:00:00.000Z",
"lasttime": "2017-07-23T10:00:00.000Z",
"lastupdt": "2017-07-23T10:00:00.000Z",
"netid": "00:22:6B:8B:CF:08",
"name": null,
"type": "infra",
"comment": null,
"wep": "Y",
"bcninterval": 0,
"freenet": "?",
"dhcp": "?",
"paynet": "?",
"userfound": false,
"channel": 6,
"encryption": "wep",
"country": "US",
"region": "VA",
"housenumber": "",
"road": "Howardsville Turnpike",
"city": "",
"postalcode": "22969"
}
Jeez, I wish bulk data was available, or somebody had built a harvester. (I found just one, part of a larger package, but it's abandoned and broken.)
Well, after getting ~20 pages of results, I was cut off. Too many queries for today
. At 100 results per page, and 20 pages per day, that will take...158 days.
This is not going to work.
WiGLE lifted my API limit to 500 pages per day, so I should have this done in a week. I harvested the first 50,000 records this evening. I created a script to do this.
Out of the 50k records collected so far, 8,736 (or 17%) have SSIDs that contain the strings xfinity
or centurylink
.
Regrettably, there's practical way to exclude Charlottesville and UVA from the polygon being queried, and that artificially depresses the percentage of ISP-branded SSIDs, because so many of these are Welcome_to_UVa_Wireless
(459), uvahealth-*
(124), hscs-*
(938), eduroam
(98), cavalier
(482), etc. — large, dispersed networks run by the university. Many hundreds of others appear in blocks of a few dozen for various hotels and large businesses. None of these will be a factor in the rural areas of the county.
15,006 have an SSID of null
.
Combining everything should be as simple as:
jq .results *.json
All hail jq.
With 100,000 results, I've got 50,300 that are in "city":"Charlottesville"
, 10 that are in Harrisonburg
(what), and the remainder in "city":""
, which is to say Albemarle County. Anecdotally, this city/county identification seems to be accurate.
2,362 records contain centurylink
and 15,147 contain xfinity
. That's 17,509 out of 100,000 records, or of course 17.5%. That's consistent with what we saw at 50,000 results, so my guess is that's not going to change much.
That said, I think once I filter this down to eliminate Charlottesville, the percentage of results that are ISP-branded are going to increase.
I'm stuck. There's an API problem with WiGLE that has me stalled out at 146,375 records. It stops returning further records at that point. The API says there are more results, but when I query that page of results, it say that there are none. (I've reported the problem to the WiGLE team, but presumably they're operating with very little in the way of resources, so they can hardly be expected to provide a fix.) I'll just have to re-query periodically, and hope it starts to work again.
It looks like the distribution of the data returned in WiGLE ain't exactly random. Big chunks of the county where I know broadband is widespread — thanks in part to looking at the mapped data on wigle.net — have nothing at all displayed yet. So a dataviz of the 48% of WiGLE data that I've got so far is misleading, because instead of showing 5 connections on a street instead of 10, it shows none.
WiGLE reports that they have duplicated the problem that I'm having, and they're working on a fix.
Here's how I'm exporting results to GeoJSON right now:
jq '.results | .[]' *.json |json2csv |grep -v "Charlottesville" |egrep -i "(xfinity|centurylink)" |csv2geojson
Note that this requires jq
, json2csv
and csv2geojson
— the latter two are npm modules.
Also, the failure of that export is that it eliminates the header row of the CSV file (since it contains neither xfinity
nor centurylink
), so figure out how to elegantly fix that. The solution is probably to stop trying to be clever by doing this in a one-liner.
This data collection is going well. I've retrieved 296,500 of 316,100 records. I should have the balance within 24 hours, at which point I'll commit them to the repo.
OK, all data collected.
The GeoJSON for the whole county is over 100 MB, which is bigger than GitHub will accept without going down the LFS rabbit hole. Not sure what to do about that.
So, I stripped the ISP records down to Albemarle County...but I did it in QGIS. That's not great, because it's not something that can be dropped into a shell script.
So adding all WiGLE records for the county as GeoJSON is impossible, because the size is over 100 MB, but it works fine as a Spatialite file. I've done that.
Thanks to the ubiquitous xfinitywifi
base stations, BSSID (netid
) is the only sensible field to use to insure that records are unique. I'm using QGIS’ "Remove Duplicate Points" to do this.
WiGLE has a huge amount of data, from people wardriving around the U.S. So in addition to collecting original data (#2), probably start by ingesting their data.