Feature request: other AHRQ datasets

jaanli commented 6 months ago

If anyone has cycles, these other datasets would be good to port to our duckdb/dbt/observable framework workflow trifecta:

https://www.ahrq.gov/sdoh/data-analytics.html

Example notebook to ease this process of importing and getting boilerplate dbt/duckdb models:

https://colab.research.google.com/github/onefact/electronic-health-records/blob/main/notebooks/loading_physionet_mimic_data.ipynb

Comment here if anyone is interested in tackling this!

chrisgebert commented 6 months ago

I started looking into this over the weekend and it appears there are 3 different datasets here:

Social Determinants of Health
Healthcare Cost and Utilization Project (HCUP)
Medical Expenditure Panel Survey (MEPS)

1. Social Determinants of Health

I downloaded these files and added sources for them to the sources.yml file I had started working on. Unioning all the separate year files together is pretty straightforward using dbt-duckdb's built-in excel plugin combined with a few dbt macros. I'll have a pull request with those updates ready maybe later this week. The only question I have is should the output be partitioned in any way. The largest parquet file (tract data) is 1.6GB once generated.

2. HCUP

There seem to be a number of separate for-purchase database files at various levels, all of which require data use agreements. Will need additional support here in order to get access to the data at all.

3. MEPS

These data files are easy to access, in multiple formats, and there is an additional repo with very good documentation and workshops too. I'm guessing the Full Year Population Characteristics datasets are the most useful, but I'm still trying to figure that out.

Does my understanding here match yours @jaanli?

jaanli commented 6 months ago

That sounds great, you're the best @chrisgebert !

This is an excellent test of the infrastructure we have to share work and resources here with the wider community.

I can add these parquet files to our s3 bucket: https://data.payless.health/ (there are several multi-gigabyte parquet files there already, such as the one from the first homework of last summer's course: https://colab.research.google.com/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb -- this actually ended up in a motherduck blog post! https://motherduck.com/blog/introducing-column-explorer/)

Or I'm happy to give you permissions to upload as well, I don't know how to set up dbt to connect to s3 yet :S

Awesome work, go team!

chrisgebert commented 6 months ago

We can define s3 settings that duckdb will use when writing the external file within the dbt-duckdb profile. Like this:

default:
  outputs:
    dev:
      type: duckdb
      path: /tmp/dbt.duckdb
      settings:
        s3_region: my-aws-region
        s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

Or we can use the use_credential_provider property, which I haven't done before but I believe supports any AWS method of authentication.

chrisgebert commented 6 months ago

Hi @jaanli,

I don't have permissions to create a pull request on this repository. Would you rather I create a fork to commit my code or how would you prefer to handle contributions?

jaanli commented 6 months ago

Thanks for letting me know @chrisgebert ! Just added you, please let me know if that doesn't work. Excited to try out the s3 settings! Haven't used that before :)

onefact / healthcare-data