All job attributes are stored as Tags, including location. Currently location is saved simply as a string, eg "San Francisco, CA". There are two problems:
These strings aren't normalized, they're scraped per job and can be anything. If you currently search "San..." you'll see "San Francisco, CA", "San Francisco", "SF/Bay Area", "SF, CA, USA", etc. So we need to normalizing locations (so "San Francisco, CA" can only ever be one location tag).
There's no locational awareness for radius search preferences (no lat/lng information).
I've looked into Google maps geocoding API, and other geocoding APIs; we can't use these due to terms issues (will explain if desired). Luckily we're using Postgres, which has a PostGIS Tiger Geocoder extension for just this purpose. I had a helluva time setting it up; and I'm if I'm not mistaken it only applies to USA? If I'm wrong, and we can set it up, we could store location tags as lat/lng tuples for (1) location normalization; (2) location radius scoring.
But let's punt Geocoder for later, and as a short-term solution simply dump all world cities into our database via the Adwords cities csv (creative commons). Then we'll prevent creating any new location tags, since they're all there. This will solve the normalization issue (not the radius issue).
Some technical notes for using adwords.csv:
We'll want to filter out too-small administrative divisions (see "Target Type"). I'm not sure which ones to use besides City, Country, Province, State (ideas?)
Process: (1) parse / filter the csv (see above); (2) upload to database (returning values); (3) store the results (along with id) to locations.json; (3) copy/paste said to client, so the file can be used both by client & server. Reasons for this procedure:
Client: the file is huge, so instead of require()ing on client (isomorphic, which will dramatically increase client's bundle.js) we can add locations.json to client/www which'll be picked up & cached by Cloudfront for much faster delivery. Additionally, it'll only be fetch()d when needed (SeedTags & CreateJob)
Server: doing a tag.text LIKE location.text every time we want to create a new job & pair its location will overload the database. So instead, keep locs = require('locations.json') handy to perform a similarity comparison (anyone have experience here?). This won't be an issue for custom-created jobs, since the client will be selecting from auto-complete options. But for scraped jobs, locations will come in willy-nilly and we'll want to find their closest matches.
All job attributes are stored as Tags, including location. Currently location is saved simply as a string, eg "San Francisco, CA". There are two problems:
I've looked into Google maps geocoding API, and other geocoding APIs; we can't use these due to terms issues (will explain if desired). Luckily we're using Postgres, which has a PostGIS Tiger Geocoder extension for just this purpose. I had a helluva time setting it up; and I'm if I'm not mistaken it only applies to USA? If I'm wrong, and we can set it up, we could store location tags as lat/lng tuples for (1) location normalization; (2) location radius scoring.
But let's punt Geocoder for later, and as a short-term solution simply dump all world cities into our database via the Adwords cities csv (creative commons). Then we'll prevent creating any new location tags, since they're all there. This will solve the normalization issue (not the radius issue).
Some technical notes for using adwords.csv:
We'll want to filter out too-small administrative divisions (see "Target Type"). I'm not sure which ones to use besides City, Country, Province, State (ideas?)
Process: (1) parse / filter the csv (see above); (2) upload to database (returning values); (3) store the results (along with
id
) tolocations.json
; (3) copy/paste said to client, so the file can be used both by client & server. Reasons for this procedure:require()
ing on client (isomorphic, which will dramatically increase client'sbundle.js
) we can addlocations.json
toclient/www
which'll be picked up & cached by Cloudfront for much faster delivery. Additionally, it'll only befetch()
d when needed (SeedTags & CreateJob)tag.text LIKE location.text
every time we want to create a new job & pair its location will overload the database. So instead, keeplocs = require('locations.json')
handy to perform a similarity comparison (anyone have experience here?). This won't be an issue for custom-created jobs, since the client will be selecting from auto-complete options. But for scraped jobs, locations will come in willy-nilly and we'll want to find their closest matches.