CG questions for RJ - Githubissues

camguage commented 3 years ago

[ ] Pushed updated figures to github under \output\figs. If these look good I can add in Overleaf
[ ] One DOL comment on this sentence ("Figure 4 presents these as percentages to account for the low counts of investigations relative to overall employers") says "Please also provide the actual data for the four categories in the text for clarity." 1) I'm not quite sure what 4 categories they're referring to, and 2) with the percentages now on the figure directly, should I include this data in the text as well? Should I include all or just part of it?
[ ] Figure 11, which is now Figure 19 after moving to appendix (called Model Performance: any WHD investigation) needs the acronyms spelled out/explained. Is there any chance you could spell these out for me?
[ ] We need to "summarize data" to support the claim that "it was less useful to predict which entities are likely to face local, TRLA-based oversight of activities versus federal oversight" (line 862). Which data/figures should I point to here?
[ ] Line 353- Sorry but I need another clarification on this one. When I add a footnote, what "ones" should I add that are filtered out- just the status codes like "CERTIFICATION EXPIRED"? Also is this what they're asking us to clarify when their question is about script 2?
[ ] Per "yep I can share an outline of this" below, wanted to confirm you already implemented the whole table as Table 1?
[ ] As I was turning the one-column tables into figures, I am running into an issue because what is now Figure 20 contains a table that is longer than a page. This isn't working inside \begin{figure} ... \end{figure}. Not sure what the fix is here. Should I keep it as a table, suppress the name "table", and write a caption* starting with "Figure 20 ... "?
[ ] For Figures 6 and 7 (in case the numbers get switched around they are the ones beginning with "Monthly patterns pre and during COVID-19...", I have 2 questions. 1) are the vertical axes for investigations counts or percents? just so I can specify and 2) they suggest changing the x-axis to labels like "Jan-2019", etc. I would go ahead and implement but I cannot seem to find the code that creates these figures
[ ] Line 735 says "pooling data from all states across 2014-2021" and we are asked to specify the data sources. Is this just the WHD compliance action data?
[ ] Lines 738/9 uses the term "training/estimation set employers" and we are asked to specify what this means. If you or Helen could do this that would be awesome
[ ] Bottom of page 31, I have a footnote link that is hanging over the margin that I cannot seem to fix. Not sure if you knew any fixes or if copyeditor could handle
[ ] Figures 12 and 13 (both titled "Predictions versus true investigations label") require clarification for how to interpret them in the text, clarification for the word "label" in the title, and "a note to explain the analytic model and the sample sizes used for the data shown." I unfortunately do not understand these figures enough to add these changes
[ ] On table 4, they're asking for more information about the analytic model/sample size (and to apply this where else is appropriate). If you or Helen could help with this that would be great
[ ] Line 873, we have "it was less useful to predict which entities are likely to face local, TRLA-based oversight of activities versus federal oversight" and they ask us to summarize the data to support the claim. Was hoping to get help as this seems to be about ML

from lizzie:

[ ] Line 121 - can you validate that what I wrote is correct?
[ ] Line 164 - question about source citation
[ ]

rebeccajohnson88 commented 3 years ago

[ ] Line 336, there is a note to “please clarify” the link 02_RenameCol… You commented “possibly a data thing to add the full set of status codes to an appendix and list ones we filter out.” Do we still want to do this? If so we might need Eunice/Grant because I didn’t work much on that part of the project

Hm I think this might be reflecting two separate things: (1) clarifying that the links are to code rather than data (since their Q was about whether the link was to data), (2) clarifying the status code filtering. For the status code filtering, I think you worked with that data in this script when you filtered to partial or fully approved certificates, so could add to a footnote what ones are filtered out? https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/09_fuzzy_matching_TRLA.R

approved_only <- h2a %>%
  filter(status_cleaned == "- CERTIFICATION" | status_cleaned == "- PARTIAL CERTIFICATION") %>%

[x] Line 354/355- Detailed info on data including webpage- if you could provide a link for me here that would be great!

Here's link to WHD compliance action data: https://enforcedata.dol.gov/views/data_summary.php

it's the data you worked with that has the compliance actions. I think you can just quote the data description in a footnote.

[x] Line 451- DOL comment asking to explain in detail the models prescribed by the study. While I wish I could explain the modeling stuff, I don’t think I understand it quite enough to do it adequately.

@helenma0223234 sent note on Slack

[x] Line 595- DOL comments “Due to the various data sources and analysis methods used in the study, we recommend providing a table listing each research question and the corresponding data sources and analysis methods to improve clarity for reader.” I don’t mind eventually putting together the table for LaTeX, but again I’m not sure if I’m familiar enough with the methods to fill out the content. Even if you could give me a sketch of a table that would be great!

Yep I can share an outline of this

rebeccajohnson88 commented 3 years ago

[x] Are we okay resolving comments about single-column boxes by changing their names to figures?

Yep I'd just change to figures --- i think you can do \begin{figure} \end{figure} and then have just the tabular inside

[x] DOL comment on Figure 9: "If the high-violation attorney/agents have no TRLA records calls, please consider removing the “None” category rather than showing “None” as 100%." How would you like me to resolve this one?

Hmm I would clarify in the caption what the none represents. In particular, the ones colored as none are ones where 100% of the attorney/agent's employers had a WHD violation but where there were no H-2A investigations related to that attorney/agent. So the point of the figure is to show that many high-WHD attorney/agents also had an H-2A intake call.

[x] They want units on the x-axis of figure 10: should I put (employers in that investigation category/all employers)?

Yep that sounds fine -- maybe keeping the current label and then adding via ( ) on a next line

[x] I updated all of the visualizations in Script 13 to include labels on the bars, per a DOL comment. Let me know if they look okay to you (I can also switch them up aesthetically), and then I can ggsave and adjust on Overleaf

ah yea it's easier for me to view them as outputs rather than run the script. could you do ggsave and push them as outputs to github?

rebeccajohnson88 commented 2 years ago

[x] Pushed updated figures to github under \output\figs. If these look good I can add in Overleaf

Yep looks good if you're able to put on overleaf

[ ] One DOL comment on this sentence ("Figure 4 presents these as percentages to account for the low counts of investigations relative to overall employers") says "Please also provide the actual data for the four categories in the text for clarity." 1) I'm not quite sure what 4 categories they're referring to, and 2) with the percentages now on the figure directly, should I include this data in the text as well? Should I include all or just part of it?

I think they meant the three categories. I think now that labels are added in text you can add a range of numbers

[X] Figure 11, which is now Figure 19 after moving to appendix (called Model Performance: any WHD investigation) needs the acronyms spelled out/explained. Is there any chance you could spell these out for me?

added; soc is only acronym they might have gotten confused by job

[X] We need to "summarize data" to support the claim that "it was less useful to predict which entities are likely to face local, TRLA-based oversight of activities versus federal oversight" (line 862). Which data/figures should I point to here?

Added discussion - there is no data since the models didnt work/just predicted all 0's or all 1's. I added that line and maybe you could propagate through to the similar one?

[ ] Line 353- Sorry but I need another clarification on this one. When I add a footnote, what "ones" should I add that are filtered out- just the status codes like "CERTIFICATION EXPIRED"? Also is this what they're asking us to clarify when their question is about script 2?

Yes just the filtered status codes
I think this was a mislabeled comment so i think addresisng status codes is fine

[X] Per "yep I can share an outline of this" below, wanted to confirm you already implemented the whole table as Table 1?

Yes the table is added

[ ] As I was turning the one-column tables into figures, I am running into an issue because what is now Figure 20 contains a table that is longer than a page. This isn't working inside \begin{figure} ... \end{figure}. Not sure what the fix is here. Should I keep it as a table, suppress the name "table", and write a caption* starting with "Figure 20 ... "?

Yep I think anything that works is fine

[ ] For Figures 6 and 7 (in case the numbers get switched around they are the ones beginning with "Monthly patterns pre and during COVID-19...", I have 2 questions. 1) are the vertical axes for investigations counts or percents? just so I can specify and 2) they suggest changing the x-axis to labels like "Jan-2019", etc. I would go ahead and implement but I cannot seem to find the code that creates these figures

Vertical axes are counts (label is: "count unique employers" so not sure why they thought unclear?)
Ah sorry this is a pushing error on my end --- here is code that you can use within script 13 but i think not huge deal if you don't get to


general_data <- general_data %>%
  mutate(JOB_START_DATE = ymd(JOB_START_DATE))

# extract the year for plotting
general_data <- general_data %>%
  mutate(year_for_plotting = year(JOB_START_DATE),
         month_year_for_plotting = floor_date(JOB_START_DATE, "month"))

##########################
# Over-time plots during COVID (so no filtering)
##########################

covid_patterns = general_data %>% filter(year_for_plotting >= 2019 &
                                        year_for_plotting < 2021)

covid_agg = covid_patterns %>%
        group_by(month_year_for_plotting) %>%
        summarise(unique_emp = length(unique(jobs_group_id)),
                  unique_employers_with_investigations = n_distinct(jobs_group_id[outcome_is_investigation_overlapsd == TRUE]),
                  unique_employers_with_violations = n_distinct(jobs_group_id[outcome_is_viol_overlapsd == TRUE])) %>%
        ungroup() %>%
        reshape2::melt(, id.var = "month_year_for_plotting") %>%
        mutate(my_pos = as.POSIXct(month_year_for_plotting,
                                   format = "%Y-%m-%d"))

### add shading
ggplot(covid_agg %>% filter(!grepl("violation", variable)), 
  aes(x = my_pos, y = value, group = variable, fill = variable)) +
  geom_bar(stat = "identity") +
  theme_DOL() +
  theme(axis.text.x = element_text(angle = 90, size = 10), 
        legend.position = c(0.15, 0.6),
        strip.text = element_blank()) +
  xlab("Month and year of job start date\n(rounded to month)") +
  ylab("Count unique employers") +
  scale_x_datetime(date_breaks = "2 months") +
  scale_y_continuous(breaks = pretty_breaks(n = 10)) +
  scale_fill_manual(values = c("unique_emp" = as.character(color_guide["jobs"]),
                               "unique_employers_with_investigations" = 
                                 as.character(color_guide['WHD investigations'])),
                    labels = 
                c("unique_emp" = "Job orders",
                "unique_employers_with_investigations" = "Investigations"))  +
  labs(fill = "") +
  geom_rect(aes(xmin= as.POSIXct('2020-03-01'),
                xmax = as.POSIXct('2020-12-31'),
                ymin = -Inf,
                ymax = Inf), fill = 'tomato1', alpha = 0.01,
            color = "white") +
  facet_wrap(~variable, scales = "free") 

ggsave(here("output/figs", "covid_focus_WHD.pdf"), 
       plot = last_plot(), 
       device = "pdf",
       width = 12, height = 8)

# put the relevant date column in a cleaner date format
trla_data <- trla_data %>%
  mutate(JOB_START_DATE = ymd(JOB_START_DATE))

# extract the year for plotting
trla_data <- trla_data %>%
  mutate(year_for_plotting = year(JOB_START_DATE),
         month_year_for_plotting = floor_date(JOB_START_DATE, "month"))

### covid over-time pre filtering
covid_patterns_tr = trla_data %>% filter(year_for_plotting >= 2019 &
                                           year_for_plotting < 2021)

covid_agg_tr = covid_patterns_tr %>%
  group_by(month_year_for_plotting) %>%
  summarise(unique_emp = length(unique(jobs_row_id)),
            unique_employers_with_investigations = n_distinct(jobs_row_id[outcome_is_investigation_overlapsd_trla == TRUE])) %>%
  ungroup() %>%
  reshape2::melt(, id.var = "month_year_for_plotting") %>%
  mutate(my_pos = as.POSIXct(month_year_for_plotting,
                             format = "%Y-%m-%d"))

### add shading
ggplot(covid_agg_tr %>% filter(!grepl("violation", variable)), 
       aes(x = my_pos, y = value, group = variable, fill = variable)) +
  geom_bar(stat = "identity") +
  theme_DOL() +
  theme(axis.text.x = element_text(angle = 90, size = 10), 
        legend.position = c(0.15, 0.9),
        strip.text = element_blank()) +
  xlab("Month and year of job start date\n(rounded to month)") +
  ylab("Count unique employers") +
  scale_x_datetime(date_breaks = "2 months") +
  scale_y_continuous(breaks = pretty_breaks(n = 10)) +
  scale_fill_manual(values = c("unique_emp" = as.character(color_guide["jobs"]),
                               "unique_employers_with_investigations" = 
                                 as.character(color_guide['WHD investigations'])),
                    labels = 
                      c("unique_emp" = "Job orders",
                        "unique_employers_with_investigations" = "Investigations"))  +
  labs(fill = "") +
  geom_rect(aes(xmin= as.POSIXct('2020-03-01'),
                xmax = as.POSIXct('2020-12-31'),
                ymin = -Inf,
                ymax = Inf), fill = 'tomato1', alpha = 0.01,
            color = "white") +
  facet_wrap(~variable, scales = "free") 

ggsave(here("output/figs", "covid_focus_trla.pdf"), 
       plot = last_plot(), 
       device = "pdf",
       width = 12, height = 8)

[X] Line 735 says "pooling data from all states across 2014-2021" and we are asked to specify the data sources. Is this just the WHD compliance action data?

All three! added discussion

[X] Lines 738/9 uses the term "training/estimation set employers" and we are asked to specify what this means. If you or Helen could do this that would be awesome

Added discussion

[ ] Bottom of page 31, I have a footnote link that is hanging over the margin that I cannot seem to fix. Not sure if you knew any fixes or if copyeditor could handle

Yep let's ask copy editor

[ ] Figures 12 and 13 (both titled "Predictions versus true investigations label") require clarification for how to interpret them in the text, clarification for the word "label" in the title, and "a note to explain the analytic model and the sample sizes used for the data shown." I unfortunately do not understand these figures enough to add these changes

[ X] On table 4, they're asking for more information about the analytic model/sample size (and to apply this where else is appropriate). If you or Helen could help with this that would be great

Thanks yea! it seems a bit weird since we have Section \ref{sec:sample_size} that discusses the exact sample size. I modified this caption. I think for others we can either leave as is or copy editor add sample size to caption.

[ ] Line 873, we have "it was less useful to predict which entities are likely to face local, TRLA-based oversight of activities versus federal oversight" and they ask us to summarize the data to support the claim. Was hoping to get help as this seems to be about ML

Answered above if could copy line here

from lizzie:

[X] Line 121 - can you validate that what I wrote is correct?

great overall! made some small edits

[X] Line 164 - question about source citation

yes that's fine! responded in overleaf

[ ]

rebeccajohnson88 commented 2 years ago

@rebeccajohnson88 -- reminder to self to add more informative captions to figures 12 and 13.

rebeccajohnson88 / qss20_s21_proj

CG questions for RJ #32