tiernanmartin / NeighborhoodChangeTypology

Project: Neighborhood Change Typology for King County, WA
Other
4 stars 2 forks source link

Store ACS Data in OSF.io #6

Closed tiernanmartin closed 6 years ago

tiernanmartin commented 6 years ago

All data should be stored in the project's osf.io project page.

Shift the make_acs_data() script to extdata/R/ and create a new version that downloads the data from osf.io.

tiernanmartin commented 6 years ago

I'm changing my approach (slightly).

I will keep the code inside make_acs_data() but will add a step that uploads the data to osf.io and then downloads it, thereby ensuring that the steps are captured within the drake plan.

This pattern makes sense for external datasets that are likely to change during the course of a project. For instance, it is very likely that I might want to add a new ACS table to the project at some point, so this pattern is helpful; however, it is unlikely that I would want to update the water bodies spatial data, so that script can stay in extdata/R -- outside the drake plan.

tiernanmartin commented 6 years ago

There is a problem with this implementation: users who do not have access to the OSF.io project will not be able to run a command that uploads data to the project.

The solution that I'd like to try includes the following steps:

  1. Create a target at the beginning of the plan which tests whether the user has permission to upload to the OSF project
  2. All commands the create external data should begin with something like:
    
    make_target <- function(has_osf_persmission){
    if(!has_osf_persmission){return(NULL)}

function text here

}

tiernanmartin commented 6 years ago

This check for OSF accesss could also be incorporated as a trigger.

tiernanmartin commented 6 years ago

The external data plan will be split into four separate plans:

  1. ext_data_prep_plan
  2. ext_data_upload_plan
  3. ext_data_download_plan
  4. ext_data_ready_plan

While all four plans will be available to any user for inspection, the first two plans are only intended to be run and/or modified by the project creator. The third and fourth plans will be the starting point for other users who are reproducing the project.

tiernanmartin commented 6 years ago

After implementing the drake plan structure I realized that there is a problem: 1st and 2nd plans are isolated from the 3rd and 4th plans (by design), so drake doesn't know that the the 1st and 2nd should be run before the 3rd and 4th.

I want the 1st and 2nd plans to be independent from the rest of the project's drake plan because these plans should only be run/modified by the project manager.

I think it is worthwhile to split the external data plan into two:

  1. data_source_plan
    • includes obj_prep_status and obj_upload_status targets
  2. data_cache_plan
    • includes obj_filepath and obj targets

The first -- data_source_plan -- would only be run when a data source change occurs and it would be run separately from the rest of the project's drake plan.

tiernanmartin commented 6 years ago

Here is a diagram of the idea behind this different architecture for the data plans:

r process

tiernanmartin commented 6 years ago

It might be helpful to put the functions that make this work into their own package (at some point).