Harvest WiGLE data - Githubissues

waldoj commented 5 years ago

WiGLE has a huge amount of data, from people wardriving around the U.S. So in addition to collecting original data (#2), probably start by ingesting their data.

[x] get data from WiGLE
[x] concatenate the paged WiGLE records into a single file
[x] eliminate any non-Albemarle records, using GDAL/OGR
[ ] eliminate all vehicles and mobile hotspots
[ ] remove duplicate records (e.g., same SSID just a few feet away)
[ ] build a Docker container that does this, to make it replicable
[x] add the data to this repo
[ ] think through whether it would be wise to identify the address from the coordinates
[ ] map it (#4)

waldoj commented 5 years ago

Looks like this is the query:

https://api.wigle.net/api/v2/network/search?onlymine=false&first=0&latrange1=37.72&latrange2=38.28&longrange1=-78.84&longrange2=-78.21&lastupdt=20150101&freenet=false&paynet=false&resultsPerPage=1000

Here's a cURL request:

curl -X GET "https://api.wigle.net/api/v2/network/search?onlymine=false&first=0&latrange1=37.72&latrange2=38.28&longrange1=-78.84&longrange2=-78.21&lastupdt=20150101&freenet=false&paynet=false" -H "accept: application/json" -u [API TOKEN] --basic

waldoj commented 5 years ago

Great. Now I just need to iterate through the results. There are...uh...316,045 of them.

waldoj commented 5 years ago

Here's a sample record:

{
  "trilat": 37.73448563,
  "trilong": -78.64722443,
  "ssid": "linksys",
  "qos": 0,
  "transid": "20170723-00000",
  "firsttime": "2017-07-22T16:00:00.000Z",
  "lasttime": "2017-07-23T10:00:00.000Z",
  "lastupdt": "2017-07-23T10:00:00.000Z",
  "netid": "00:22:6B:8B:CF:08",
  "name": null,
  "type": "infra",
  "comment": null,
  "wep": "Y",
  "bcninterval": 0,
  "freenet": "?",
  "dhcp": "?",
  "paynet": "?",
  "userfound": false,
  "channel": 6,
  "encryption": "wep",
  "country": "US",
  "region": "VA",
  "housenumber": "",
  "road": "Howardsville Turnpike",
  "city": "",
  "postalcode": "22969"
}

waldoj commented 5 years ago

Jeez, I wish bulk data was available, or somebody had built a harvester. (I found just one, part of a larger package, but it's abandoned and broken.)

waldoj commented 5 years ago

Well, after getting ~20 pages of results, I was cut off. Too many queries for today. At 100 results per page, and 20 pages per day, that will take...158 days.

This is not going to work.

waldoj commented 5 years ago

WiGLE lifted my API limit to 500 pages per day, so I should have this done in a week. I harvested the first 50,000 records this evening. I created a script to do this.

waldoj commented 5 years ago

Out of the 50k records collected so far, 8,736 (or 17%) have SSIDs that contain the strings xfinity or centurylink.

Regrettably, there's practical way to exclude Charlottesville and UVA from the polygon being queried, and that artificially depresses the percentage of ISP-branded SSIDs, because so many of these are Welcome_to_UVa_Wireless (459), uvahealth-* (124), hscs-* (938), eduroam (98), cavalier (482), etc. — large, dispersed networks run by the university. Many hundreds of others appear in blocks of a few dozen for various hotels and large businesses. None of these will be a factor in the rural areas of the county.

15,006 have an SSID of null.

waldoj commented 5 years ago

Combining everything should be as simple as:

jq .results *.json

All hail jq.

waldoj commented 5 years ago

With 100,000 results, I've got 50,300 that are in "city":"Charlottesville", 10 that are in Harrisonburg (what), and the remainder in "city":"", which is to say Albemarle County. Anecdotally, this city/county identification seems to be accurate.

2,362 records contain centurylink and 15,147 contain xfinity. That's 17,509 out of 100,000 records, or of course 17.5%. That's consistent with what we saw at 50,000 results, so my guess is that's not going to change much.

That said, I think once I filter this down to eliminate Charlottesville, the percentage of results that are ISP-branded are going to increase.

waldoj commented 5 years ago

I'm stuck. There's an API problem with WiGLE that has me stalled out at 146,375 records. It stops returning further records at that point. The API says there are more results, but when I query that page of results, it say that there are none. (I've reported the problem to the WiGLE team, but presumably they're operating with very little in the way of resources, so they can hardly be expected to provide a fix.) I'll just have to re-query periodically, and hope it starts to work again.

waldoj commented 5 years ago

It looks like the distribution of the data returned in WiGLE ain't exactly random. Big chunks of the county where I know broadband is widespread — thanks in part to looking at the mapped data on wigle.net — have nothing at all displayed yet. So a dataviz of the 48% of WiGLE data that I've got so far is misleading, because instead of showing 5 connections on a street instead of 10, it shows none.

waldoj commented 5 years ago

WiGLE reports that they have duplicated the problem that I'm having, and they're working on a fix.

waldoj commented 5 years ago

Here's how I'm exporting results to GeoJSON right now:

jq '.results | .[]' *.json |json2csv |grep -v "Charlottesville" |egrep -i "(xfinity|centurylink)" |csv2geojson

Note that this requires jq, json2csv and csv2geojson — the latter two are npm modules.

Also, the failure of that export is that it eliminates the header row of the CSV file (since it contains neither xfinity nor centurylink), so figure out how to elegantly fix that. The solution is probably to stop trying to be clever by doing this in a one-liner.

waldoj commented 5 years ago

This data collection is going well. I've retrieved 296,500 of 316,100 records. I should have the balance within 24 hours, at which point I'll commit them to the repo.

waldoj commented 5 years ago

OK, all data collected.

waldoj commented 5 years ago

The GeoJSON for the whole county is over 100 MB, which is bigger than GitHub will accept without going down the LFS rabbit hole. Not sure what to do about that.

waldoj commented 5 years ago

So, I stripped the ISP records down to Albemarle County...but I did it in QGIS. That's not great, because it's not something that can be dropped into a shell script.

waldoj commented 5 years ago

So adding all WiGLE records for the county as GeoJSON is impossible, because the size is over 100 MB, but it works fine as a Spatialite file. I've done that.

waldoj commented 5 years ago

Thanks to the ubiquitous xfinitywifi base stations, BSSID (netid) is the only sensible field to use to insure that records are unique. I'm using QGIS’ "Remove Duplicate Points" to do this.

waldoj / albemarle-broadband

Harvest WiGLE data #7