Startup notes - Githubissues

Geolocation data:

Main challenge: getting tract identifiers onto the violations data if they don't already exist

[ ] Look at the address fields available in the violations data (exploratory notebook here you can copy over to your repo and build on: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/00_explore_WHD_data.ipynb). Subset to ones with any H2A violations
[ ] If Lat and Long don't exist already in the data, either use zip codes as the geographic identifier or do geocoding using address and a geocoding API. Google maps geocoder, Geocod.io API, and Census geocoder are a couple options; you'll need to do in batches/probably in a .py given runtime
[ ] Once you have lat/long, transform those into spatial points data and intersect them with census tracts to find which census tract a given location is located in
[ ] Use the Census API or data explorer to pull American Community Survey (ACS) data for relevant years/variables

Ways to make scope more manageable: possibly focusing on states within TRLA catchment area ('TX', 'MS', 'LA', 'KY', 'AL', 'TN') to have fewer things to geocode and fewer sets of tracts to pull; could also increase geographic unit of analysis to zip code or county, but less interesting; if time, could add housing locations data (in this repo)

DOL staffing:

Main challenge: getting the OPM workforce data into a usable format/investigating granularity of staffing data

[ ] Visit site containing OPM workforce data across many years and read through documentation (https://www.opm.gov/data/index.aspx). Things to investigate: (1) am pretty sure already it gets down to the granularity of wage and hour division within dol, but you'll want to see what the granularity of job titles are (eg does it have a category for inspectors or just general dol staff?) and what the granularity of geographic identifiers are (eg does it contain specific office locations, county-level sites, or just state?)
[ ] Write code that can iterate over the different OPM years, subset the rows you need, and rowbind across years. You can decide on things like location and year range using violations data - https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/00_explore_WHD_data.ipynb
[ ] Aggregate h2a violations to whatever aggregation level makes sense given DOL data (e.g., state; broader regions)
[ ] Can then do both descriptives and if you want, ML or statistical modeling --- if doing latter, make sure that staffing is measured pre-investigation, so you're looking at how staffing in 2017, for instance, predicts investigations either later that year or in the following year (depending on time granularity of OPM data)

Text as data:

Main challenge: data is already acquired so data acquisition straightforward; main goal is using the problem set two text as data content to find interesting patterns/expanding beyond that if relevant and also merging of jobs/violations data

[ ] Explore and clean data containing text of job addendums and figure out data structure (e.g., there are repeated rows with the same job order but for different addendums)- https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/data/raw_data/FOIA_2021-F-05932_raw_data.xlsx
[ ] Merge with the jobs data - top of this script contains a link to the most recent quarter but you'll want to pull all overlapping quarters (https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/01_explorequarterly.ipynb)
[ ] Merge jobs with violations data (might want to coordinate with ML/stats group) to construct a violations indicator
[ ] Use text as data techniques to explore differences in the language of addendums b/t employers with violations and employers without

ML/stats:

Main challenge: since not explicitly covered in course content, ML code in sklearn or other packages

[ ] Use DOL quarterly jobs data as the universe of potential employers--- see note for above group about where to find/relevant script
[ ] Decide on time unit of analysis --- e.g., is it going to be all employers ever, employers repeated across months, employers repeated across quarters, etc
[ ] Use violations data to create different binary labels in employer dataset- "investigated that month," "violation that month", etc
[ ] Predict those labels using whatever features are interesting w/in jobs data

rebeccajohnson88 / qss20_s21_proj

Startup notes #2