rebeccajohnson88 / qss20_s21_proj

Repo for DOL Summer Data Challenge on equity in H-2A oversight
Creative Commons Zero v1.0 Universal
2 stars 2 forks source link

Exploratory data analysis #1

Closed rebeccajohnson88 closed 3 years ago

rebeccajohnson88 commented 3 years ago
rebeccajohnson88 commented 3 years ago

Questions:

rebeccajohnson88 commented 3 years ago

> Questions:

  • [X] for DOL i kept all the columns names if that's fine?

Yep it's fine to keep the original column names but just make sure there are no spaces or special punctuation

  • [X] each row is a case record, each record has different "total violation count" and "h2a/b violation count" (some are like a few some are thousands or more) i'm not sure how if this count is accumulative from the past OR it represents number of cases filed to the government agency by the time this data was entered OR number of employees get evolved there aren't a lot of duplicates of employees so it's not like each entry of row represents one violation count
  • [X] so if we were to do further plotting/analysis, for example based on specific year, do we want to sum all of the violation counts or count number of data entered (i.e like frequency) in that year?

If you look at the data dictionary here: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/data/documentation/whd_data_dictionary.csv it distinguishes between two units of analysis:

(1) A case-- this represents one investigation against an employer at a particular point in time (2) Violations --- each case can find multiple violations (so the total violation counts)-- these aren't necessarily cumulative but are violations found as part of the same case/investigation

So for plotting/descriptives, I'd have some descriptives focusing on (1) count of cases by employer/area etc, (2) count of violations by employer/area, (3) distribution of # of violations by case

Other two dos:

Rather than having the separate data_exploration folder, could you (1) delete, (2) create a folder called code, and (3) for the data exploration script, call it 00_explore_WHD_data