healthyregions / oepsData

An R package for easy access to the Opioid Environment Policy Scan (OEPS) datasets.
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

Add all OEPS data CSVs to scripts in data-raw #2

Closed mradamcox closed 1 month ago

mradamcox commented 1 month ago

Now that we have a single dataset working well, we need to add all the other ones. Ideally this could be done by directly downloading the CSV content within the script and writing out to .rda files that way, without the need to download and store the raw data in this repository.

An example source URL would be: https://raw.githubusercontent.com/GeoDaCenter/opioid-policy-scan/main/data_final/full_tables/C_1980.csv

One open question about this is whether we can/should store geometry data in the .rda files. If we can, then perhaps a join should be carried out before calling usethis::use_data(). Similar to above, it would be ideal if we can load geometry data through http calls instead of having to store those datasets in this package. We have shp/geojson files in S3, and could make other formats if needed like CSV with WKT geometries, etc.

bucketteOfIvy commented 1 month ago

After poking through this, our data size likely forces hosting limitations on us if we want to store the data locally. Without including geographies, we already have a total of over 20MB of data when using optimal compression, which is 4 times the 5MB cap set by CRAN, meaning we'd likely be limited to hosting on GitHub and forcing users to install via devtools. However, when geometries are included a few of our files balloon to 150 MB in size -- larger than GitHub's file size restriction* -- which could force us to find a new hosting solution entirely.

With that said, I think we have a few ways forward, depending on what our goals for end-users are.

  1. If we want to launch on CRAN, we could force users to access data through the load function. This would let us pull the data from other locations (e.g. S3, BigQuery) on-demand -- caching it upon user request to avoid redundant downloads -- meaning the data does not have to sit in the package itself. Additionally, it would let us tether the package more closely to the back end data sources, and allow us to "spread the wealth" of improvements to those sources. For instance, I could see this approach motivating the inclusion of a "suggested theme" variable to the tabular data in BigQuery which could be filtered on in a SQL query. The main disadvantages of this approach are that it opens up questions about data documentation -- maybe we would need a load_docs function akin to the get_variables function from tidycensus? -- and also prevents the package from being used when offline.
  2. If we are fine only launching on GitHub but really want geometries stored in the data package, we might be able to split the data up into more chunks. I'm not actually sure of how many chunks we need yet, nor if there are "reasonable sounding" chunks that we could use, but it should be doable.
  3. If we are fine only launching on GitHub and are willing to not store geometries in the data package directly, we can leave AddDataframes.R and the data folder as they are, but update the load to pull geometries upon user request via tigris. This is something of a middle ground approach, and (while we can only guarantee launching on GitHub) it's not necessarily impossible for us to take this approach and still launch on CRAN, given we agree to edit the package very rarely.

From my perspective, approach 1 seems to solve most of our problems and enable us to tether the package to the BigQuery back end. However, while this tethering seems mostly good -- it would mean that, e.g., we would be able to improve both the package and the back end with some of the same code and work -- it could also make it harder for us to expand this from an opioid risk environment package to an SDOH package, so there is a potential tradeoff at play.

*Although there may be ways around this while still hosting on GitHub that I'm unaware of

mradamcox commented 1 month ago

Per our discussion today,we'll head toward 3 for the time being, with 1 being a likely long-term solution. At ~20mb, our current non-spatial tables are plenty small enough to be hosted here on Github, so we can continue developing with them for now as we look at the question of geometries. Specifically: if we were to create our own curated set of geography data (from TIGER/Lines and/or cartographic boundary files) as store it remotely on S3, what file format should we use to make loading in R as efficient as possible?

With regard to docs, we could either use load_docs to pull dynamically from our schema files in the oeps repo, or could perhaps build a script that in the repo that does the same thing and stores docs files in the package itself?

I'm going to close this for now, as all of the CSV files are now in data, and we are on to the next part of this process...