[ ] If Lat and Long don't exist already in the data, either use zip codes as the geographic identifier or do geocoding using address and a geocoding API. Google maps geocoder, Geocod.io API, and Census geocoder are a couple options; you'll need to do in batches/probably in a .py given runtime
[ ] Once you have lat/long, transform those into spatial points data and intersect them with census tracts to find which census tract a given location is located in
[ ] Use the Census API or data explorer to pull American Community Survey (ACS) data for relevant years/variables
Ways to make scope more manageable: possibly focusing on states within TRLA catchment area ('TX', 'MS', 'LA', 'KY', 'AL', 'TN') to have fewer things to geocode and fewer sets of tracts to pull; could also increase geographic unit of analysis to zip code or county, but less interesting; if time, could add housing locations data (in this repo)
DOL staffing:
Main challenge: getting the OPM workforce data into a usable format/investigating granularity of staffing data
[ ] Visit site containing OPM workforce data across many years and read through documentation (https://www.opm.gov/data/index.aspx). Things to investigate: (1) am pretty sure already it gets down to the granularity of wage and hour division within dol, but you'll want to see what the granularity of job titles are (eg does it have a category for inspectors or just general dol staff?) and what the granularity of geographic identifiers are (eg does it contain specific office locations, county-level sites, or just state?)
[ ] Aggregate h2a violations to whatever aggregation level makes sense given DOL data (e.g., state; broader regions)
[ ] Can then do both descriptives and if you want, ML or statistical modeling --- if doing latter, make sure that staffing is measured pre-investigation, so you're looking at how staffing in 2017, for instance, predicts investigations either later that year or in the following year (depending on time granularity of OPM data)
Text as data:
Main challenge: data is already acquired so data acquisition straightforward; main goal is using the problem set two text as data content to find interesting patterns/expanding beyond that if relevant and also merging of jobs/violations data
[ ] Merge jobs with violations data (might want to coordinate with ML/stats group) to construct a violations indicator
[ ] Use text as data techniques to explore differences in the language of addendums b/t employers with violations and employers without
ML/stats:
Main challenge: since not explicitly covered in course content, ML code in sklearn or other packages
[ ] Use DOL quarterly jobs data as the universe of potential employers--- see note for above group about where to find/relevant script
[ ] Decide on time unit of analysis --- e.g., is it going to be all employers ever, employers repeated across months, employers repeated across quarters, etc
[ ] Use violations data to create different binary labels in employer dataset- "investigated that month," "violation that month", etc
[ ] Predict those labels using whatever features are interesting w/in jobs data
Geolocation data:
Main challenge: getting tract identifiers onto the violations data if they don't already exist
Ways to make scope more manageable: possibly focusing on states within TRLA catchment area ('TX', 'MS', 'LA', 'KY', 'AL', 'TN') to have fewer things to geocode and fewer sets of tracts to pull; could also increase geographic unit of analysis to zip code or county, but less interesting; if time, could add housing locations data (in this repo)
DOL staffing:
Main challenge: getting the OPM workforce data into a usable format/investigating granularity of staffing data
Text as data:
Main challenge: data is already acquired so data acquisition straightforward; main goal is using the problem set two text as data content to find interesting patterns/expanding beyond that if relevant and also merging of jobs/violations data
ML/stats:
Main challenge: since not explicitly covered in course content, ML code in sklearn or other packages