grouping variables in grab_lodes() tibbles

aakarner commented 6 years ago

Thanks for putting this package together - it's a lot more elegant than constantly pulling down csvs from the census bureau's site.

I'm wondering why the tibbles returned by grab_lodes() have state and year grouping variables set by default. I see that this kind of makes sense if you're pulling data for multiple states/years, but even then it seems like you'd want to give the user flexibility to define their own groups.

I just needed to calculate some quick accessibility measures, for example, so I inner_join()ed a skim file to the 2014 GA wac file. I join on the destination from the skim and then want to group by origin (to get, e.g. total jobs accessible within 45 mins). If I run the necessary dplyr steps without first ungroup()ing, I get a warning that the grouping variables are being added back in. My final tibble then has several extra columns that simply repeat the year and the state.

Maybe provide a parameter in the function call to disable grouping in the output? Or disable it by default and allow the user to specify grouping variables in the output?

dillonma commented 6 years ago

Hey @aakarner -- awesome use case, totally heard. Help me understand the skim file you're using. What's the geographic resolution -- is it block, bg, tract?

aakarner commented 6 years ago

The skim table contains the origin tract geoid, destination tract geoid, and travel time for some set of population-weighted census tract centroids.

Here's an example. Skim data available here: https://www.dropbox.com/s/un32m83r4yzjl2e/SampleAutoSkims.csv?dl=0.

library(dplyr)
library(lehdr)

# Read skim data
auto_skims <- read.csv("SampleAutoSkims_ArcOnline.csv")
auto_skims$DestinationName <- 
  as.character(auto_skims$DestinationName)
auto_skims$OriginName <- 
  as.character(auto_skims$OriginName)

# Read LODES data
# Without this ungroup(), I get a warning
ga_jobs <- ungroup(
  grab_lodes(state = "ga", year = 2014, lodes_type = "wac", agg_geo = "tract",
                      job_type = "JT00", segment = "S000"))

# Combine jobs and skim data by the destination location
acc_data <- inner_join(
  # Select only required variables from skims and jobs
  select(auto_skims, OriginName, DestinationName, Total_Time),
  select(ga_jobs, w_tract_id, C000),
  by = c("DestinationName" = "w_tract_id")) %>%
  # Add in a Hansen-style gravity decay factor
  mutate(decay = exp(Total_Time * -0.1))

# Calculate cumulative opportunities accessibility (45-min threshold)
acc_cumul <- acc_data %>%
  filter(Total_Time <= 45) %>%
  group_by(OriginName) %>%
  summarize(acc45 = sum(C000))

# Calculate gravity accessibility
acc_grav <- acc_data %>%
  group_by(OriginName) %>%
  summarize(accgrav = sum(C000 * decay))

jamgreen commented 6 years ago

Yeah the primary use case for mass downloads of the LODES data for multiple states for multiple years. We'll take a look at it see if there's a less hamfisted solution here so you're not getting this understandably unexpected behavior.

jamgreen commented 5 years ago

Sorry this took so long, but I recently updated lehdr and the resulting tibbles should be ungrouped. Please test and let me know if you've run into the same issue.

aakarner commented 4 years ago

Looks good now, thanks!

jamgreen / lehdr

grouping variables in grab_lodes() tibbles #14