Exploratory data analysis

rebeccajohnson88 commented 3 years ago

[x] Loading the wage and hour division enforcement data linked to in the readme--- giving some basic descriptives on how businesses are identified (business name, business address, ein or tax id), whether there's a field indicating whether the employer is part of the H-2 visa program, # of rows, span of years
[ ] Similarly, load the H-2 jobs data and start to explore --- @rebeccajohnson88 will check with @TRLegalAid on whether we can get a static copy that's behind the interface

rebeccajohnson88 commented 3 years ago

Questions:

[ ] for DOL i kept all the columns names if that's fine?
[ ] each row is a case record, each record has different "total violation count" and "h2a/b violation count" (some are like a few some are thousands or more) i'm not sure how if this count is accumulative from the past OR it represents number of cases filed to the government agency by the time this data was entered OR number of employees get evolved there aren't a lot of duplicates of employees so it's not like each entry of row represents one violation count
[ ] so if we were to do further plotting/analysis, for example based on specific year, do we want to sum all of the violation counts or count number of data entered (i.e like frequency) in that year?

rebeccajohnson88 commented 3 years ago

> Questions:

[X] for DOL i kept all the columns names if that's fine?

Yep it's fine to keep the original column names but just make sure there are no spaces or special punctuation

[X] each row is a case record, each record has different "total violation count" and "h2a/b violation count" (some are like a few some are thousands or more) i'm not sure how if this count is accumulative from the past OR it represents number of cases filed to the government agency by the time this data was entered OR number of employees get evolved there aren't a lot of duplicates of employees so it's not like each entry of row represents one violation count

[X] so if we were to do further plotting/analysis, for example based on specific year, do we want to sum all of the violation counts or count number of data entered (i.e like frequency) in that year?

If you look at the data dictionary here: https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/data/documentation/whd_data_dictionary.csv it distinguishes between two units of analysis:

(1) A case-- this represents one investigation against an employer at a particular point in time (2) Violations --- each case can find multiple violations (so the total violation counts)-- these aren't necessarily cumulative but are violations found as part of the same case/investigation

So for plotting/descriptives, I'd have some descriptives focusing on (1) count of cases by employer/area etc, (2) count of violations by employer/area, (3) distribution of # of violations by case

Other two dos:

Rather than having the separate data_exploration folder, could you (1) delete, (2) create a folder called code, and (3) for the data exploration script, call it 00_explore_WHD_data

rebeccajohnson88 / qss20_s21_proj

Exploratory data analysis #1