onefact / healthcare-data

Examples of exploring synthetic healthcare data from the Agency for Healthcare Research and Quality in the United States Department of Health and Human Services, and MITRE Corporation.
https://onefact.github.io/synthetic-healthcare-data/
Apache License 2.0
4 stars 5 forks source link

chrisgebert: Add dbt config files and Social Determinants of Health models #10

Closed chrisgebert closed 5 months ago

chrisgebert commented 6 months ago

I'm adding a few things in this PR, mostly to support the addition of datasets from Social Determinants of Health as mentioned in Issue #5 including:

  1. a souces.yml file
  2. a profiles.yml file
  3. Social Determinants of Health models

1. sources.yml

This configuration file is to be used by dbt to define different sources, tables, column types, and even local locations of source files in the dbt-duckdb adapter when running relevant dbt commands. This sources.yml file will reduce the need for repeated code across existing inpatient, outpatient, pharmacy, and provider models for SyH-DR data, though it's important to note that those models are unchanged and so will not refer to these source definitions until modified to do so. If it's preferable to remove these unused source configurations until the models are actually using them, I'm happy to do so to reduce potential confusion. I don't have access to that data and would like to perform some testing or work with someone to do some testing before changing those models to use these sources.

I also added sources for the other existing models (consumer_price_index and synthea) and the same applies to them: the downstream models are unchanged and will not use these source definitions until updated to do so.

Finally, in this file, I defined sources to be used by the Social Determinants of Health models that I describe in more detail below. Each of these uses the built-in dbt-duckdb plugin to read the specified Excel sheets into a dbt relation that are then referenced in the model transformation logic.

2. profiles.yml

This is a very simple duckdb profile configuration file that can be embedded within this project so that different users/contributors do not need to maintain and configure the adapter on their own. If we move towards providing s3 access via keys and secrets, we'd implement that through the use of environment variables that can be read by the dbt-duckdb adapter like I mentioned.

3. Social Determinants of Health models

Three models to unpivot and union together all years from the Social Determinants of Health datasets across the various data files: County Data, Zip Code Data, and Census Tract Data.

I had downloaded these files manually and stored them in the project directory at /data/sdoh, which corresponds to their external_location in the sources.yml file for each of these sources.

I'm certain there is a better way to union these files together (using a combination of dbt-utils.get_relations_by_pattern and dbt_utils.union_relations somehow) but I couldn't get the Jinja syntax to work while also unpivoting each of these.

jaanli commented 6 months ago

Amazing, thank you @chrisgebert !!

Testing now:

  1. Clone the repo: git clone git@github.com:onefact/healthcare-data.git
  2. Checkout this pull request: git checkout origin/chrisgebert-add_dbt_sources
  3. Install environment: python3 -m venv .venv && pip install --upgrade pip && source .venv/bin/activate.fish && pip install -r requirements.txt
  4. Go to directory: cd data_processing
  5. Install dbt dependencies: dbt deps
  6. Download syh-dr-csv.zip and move to ~/data/syh_dr
  7. Unzip: unzip syh-dr-csv.zip
  8. Run dbt run

Output:

``` healthcare-data/data_processing on  HEAD (aa330b9) [?] via 🐍 v3.10.14 (.venv) 🕙 26s ❯ dbt run 13:31:59 Running with dbt=1.7.11 13:32:02 Registered adapter: duckdb=1.7.3 13:32:02 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources. There are 1 unused configuration paths: - models.healthcare_data.example 13:32:02 Found 22 models, 41 sources, 0 exposures, 0 metrics, 527 macros, 0 groups, 0 semantic models 13:32:02 13:32:16 Concurrency: 1 threads (target='dev') 13:32:16 13:32:16 1 of 22 START python table model main.download_consumer_price_index ............ [RUN] 13:32:16 1 of 22 ERROR creating python table model main.download_consumer_price_index ... [ERROR in 0.08s] 13:32:16 2 of 22 START sql external model main.insurance_plan_payment_histogram ......... [RUN] 13:32:16 2 of 22 ERROR creating sql external model main.insurance_plan_payment_histogram [ERROR in 0.04s] 13:32:16 3 of 22 START sql external model main.sdoh_county .............................. [RUN] 13:32:16 Unhandled error while executing [Errno 2] No such file or directory: '../data/SDOH/SDOH_2009_COUNTY_1_0.xlsx' 13:32:16 3 of 22 ERROR creating sql external model main.sdoh_county ..................... [ERROR in 0.01s] 13:32:16 4 of 22 START sql external model main.sdoh_tract ............................... [RUN] 13:32:16 Unhandled error while executing [Errno 2] No such file or directory: '../data/SDOH/sdoh_2009_tract_1_0.xlsx' 13:32:16 4 of 22 ERROR creating sql external model main.sdoh_tract ...................... [ERROR in 0.01s] 13:32:16 5 of 22 START sql external model main.sdoh_zipcode ............................. [RUN] 13:32:16 Unhandled error while executing [Errno 2] No such file or directory: '../data/SDOH/SDOH_2011_ZIPCODE_1_0.xlsx' 13:32:16 5 of 22 ERROR creating sql external model main.sdoh_zipcode .................... [ERROR in 0.01s] 13:32:16 6 of 22 START sql external model main.syhdr_commercial_inpatient_2016 .......... [RUN] 13:32:18 6 of 22 OK created sql external model main.syhdr_commercial_inpatient_2016 ..... [OK in 2.50s] 13:32:18 7 of 22 START sql external model main.syhdr_commercial_outpatient_2016 ......... [RUN] 13:32:54 7 of 22 OK created sql external model main.syhdr_commercial_outpatient_2016 .... [OK in 35.64s] 13:32:54 8 of 22 START sql external model main.syhdr_commercial_person_2016 ............. [RUN] 13:33:15 8 of 22 OK created sql external model main.syhdr_commercial_person_2016 ........ [OK in 20.69s] 13:33:15 9 of 22 START sql external model main.syhdr_commercial_pharmacy_2016 ........... [RUN] 13:33:38 9 of 22 OK created sql external model main.syhdr_commercial_pharmacy_2016 ...... [OK in 23.21s] 13:33:38 10 of 22 START sql external model main.syhdr_medicaid_inpatient_2016 ........... [RUN] 13:33:42 10 of 22 OK created sql external model main.syhdr_medicaid_inpatient_2016 ...... [OK in 3.59s] 13:33:42 11 of 22 START sql external model main.syhdr_medicaid_outpatient_2016 .......... [RUN] 13:34:21 11 of 22 OK created sql external model main.syhdr_medicaid_outpatient_2016 ..... [OK in 38.91s] 13:34:21 12 of 22 START sql external model main.syhdr_medicaid_person_2016 .............. [RUN] 13:34:43 12 of 22 OK created sql external model main.syhdr_medicaid_person_2016 ......... [OK in 22.00s] 13:34:43 13 of 22 START sql external model main.syhdr_medicaid_pharmacy_2016 ............ [RUN] 13:35:07 13 of 22 OK created sql external model main.syhdr_medicaid_pharmacy_2016 ....... [OK in 24.38s] 13:35:07 14 of 22 START sql external model main.syhdr_medicaid_provider_2016 ............ [RUN] 13:35:07 14 of 22 OK created sql external model main.syhdr_medicaid_provider_2016 ....... [OK in 0.05s] 13:35:07 15 of 22 START sql external model main.syhdr_medicare_inpatient_2016 ........... [RUN] 13:35:10 15 of 22 OK created sql external model main.syhdr_medicare_inpatient_2016 ...... [OK in 2.92s] 13:35:10 16 of 22 START sql external model main.syhdr_medicare_outpatient_2016 .......... [RUN] 13:35:54 16 of 22 OK created sql external model main.syhdr_medicare_outpatient_2016 ..... [OK in 44.39s] 13:35:54 17 of 22 START sql external model main.syhdr_medicare_person_2016 .............. [RUN] 13:36:08 17 of 22 OK created sql external model main.syhdr_medicare_person_2016 ......... [OK in 14.04s] 13:36:08 18 of 22 START sql external model main.syhdr_medicare_pharmacy_2016 ............ [RUN] 13:36:57 18 of 22 OK created sql external model main.syhdr_medicare_pharmacy_2016 ....... [OK in 48.82s] 13:36:57 19 of 22 START sql external model main.syhdr_medicare_provider_2016 ............ [RUN] 13:36:57 19 of 22 OK created sql external model main.syhdr_medicare_provider_2016 ....... [OK in 0.06s] 13:36:57 20 of 22 START sql external model main.synthea ................................. [RUN] 13:36:58 20 of 22 OK created sql external model main.synthea ............................ [OK in 0.41s] 13:36:58 21 of 22 SKIP relation main.consumer_price_index ............................... [SKIP] 13:36:58 22 of 22 SKIP relation main.insurance_plan_payment_histogram_inflation_adjusted [SKIP] 13:36:58 13:36:58 Finished running 1 table model, 21 external models in 0 hours 4 minutes and 55.80 seconds (295.80s). 13:36:58 13:36:58 Completed with 5 errors and 0 warnings: 13:36:58 13:36:58 Runtime Error in model download_consumer_price_index (models/bls.gov/download_consumer_price_index.py) Python model failed: [Errno 2] No such file or directory: '/Users/me/data/syh_dr/r-cpi-u-rs-allitems.xlsx' 13:36:58 13:36:58 Runtime Error in model insurance_plan_payment_histogram (models/figures/insurance_plan_payment_histogram.sql) IO Error: No files found that match the pattern "/Users/me/data/syh_dr/syhdr_commercial_inpatient_2016.parquet" 13:36:58 13:36:58 [Errno 2] No such file or directory: '../data/SDOH/SDOH_2009_COUNTY_1_0.xlsx' 13:36:58 13:36:58 [Errno 2] No such file or directory: '../data/SDOH/sdoh_2009_tract_1_0.xlsx' 13:36:58 13:36:58 [Errno 2] No such file or directory: '../data/SDOH/SDOH_2011_ZIPCODE_1_0.xlsx' 13:36:58 13:36:58 Done. PASS=15 WARN=0 ERROR=5 SKIP=2 TOTAL=22 ```

Re: how to replicate:

I had downloaded these files manually and stored them in the project directory at /data/sdoh, which corresponds to their external_location in the sources.yml file for each of these sources.

Do you have links to where I can also download these files? I assume that is why I get the above errors.

I can put the download script into python so hopefully it can be part of the dbt model too :)

I can also give you AWS SSO credentials in case you want to test the S3 upload on your end!

P.S. This PR is awesome because it will help us replicate https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3072038 (!!). We have been on this for close to a year and this data is just very hard to find, even confirmed with a few associate professors of law at Columbia who work with the Fed...

chrisgebert commented 6 months ago

@jaanli I had downloaded all the files that are linked in this table with the exception of the Codebook files.

Once you have those files locally, you can run just those downstream models by selecting the source:

dbt run --select source:social_determinants_of_health+

I've looked at the Codebook files since submitting the PR, and the Variable Label in these files may be useful in joining in to the output files for additional context. I'll keep that in mind for later development work.

jaanli commented 5 months ago

thank you @chrisgebert !

here's the prompt: https://pastebin.com/G1M2wnXa (using https://gist.github.com/jaanli/5def01b7bd674efd6d9008cf1125986d)

added the resulting file from @anthropics.

execute with:

dbt run --select "models/ahrq.gov/sdoh/download*"

output of dbt run --select source:social_determinants_of_health+:

healthcare-data/data_processing on  HEAD (aa330b9) [?] via 🐍 v3.10.14 (.venv) 🕙 60ms ❯ dbt run --select source:social_determinants_of_health+ 12:33:24 Running with dbt=1.7.11 12:33:24 Registered adapter: duckdb=1.7.3 12:33:24 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources. There are 1 unused configuration paths: - models.healthcare_data.example 12:33:24 Found 23 models, 41 sources, 0 exposures, 0 metrics, 527 macros, 0 groups, 0 semantic models 12:33:24 12:33:25 Concurrency: 1 threads (target='dev') 12:33:25 12:33:25 1 of 3 START sql external model main.sdoh_county ............................... [RUN] 12:38:12 1 of 3 OK created sql external model main.sdoh_county .......................... [OK in 287.04s] 12:38:12 2 of 3 START sql external model main.sdoh_tract ................................ [RUN] libc++abi: terminating due to uncaught exception of type duckdb::IOException: {"exception_type":"IO","exception_message":"Could not truncate file \".tmp/duckdb_temp_storage-3.tmp\": No space left on device","errno":"28"} fish: Job 1, 'dbt run --select source:social_…' terminated by signal SIGABRT (Abort) /Users/me/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

what are next steps to debug this?

at https://data.payless.health/#hospital_price_transparency/

chrisgebert commented 5 months ago

❯ dbt run --select source:social_determinants_of_health+

12:33:24 Running with dbt=1.7.11

12:33:24 Registered adapter: duckdb=1.7.3

12:33:24 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.

There are 1 unused configuration paths:

  • models.healthcare_data.example

12:33:24 Found 23 models, 41 sources, 0 exposures, 0 metrics, 527 macros, 0 groups, 0 semantic models

12:33:24

12:33:25 Concurrency: 1 threads (target='dev')

12:33:25

12:33:25 1 of 3 START sql external model main.sdoh_county ............................... [RUN]

12:38:12 1 of 3 OK created sql external model main.sdoh_county .......................... [OK in 287.04s]

12:38:12 2 of 3 START sql external model main.sdoh_tract ................................ [RUN]

libc++abi: terminating due to uncaught exception of type duckdb::IOException: {"exception_type":"IO","exception_message":"Could not truncate file \".tmp/duckdb_temp_storage-3.tmp\": No space left on device","errno":"28"}

fish: Job 1, 'dbt run --select source:social_…' terminated by signal SIGABRT (Abort)

/Users/me/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown

warnings.warn('resource_tracker: There appear to be %d '

what are next steps to debug this?

This can be fixed by adding a temp_directory config setting in the dbt profiles.yml file to allow duckdb to spill temp files to your local disk when it requires more than your local machine's memory.

Like this:

healthcare_data:

  target: dev
  outputs:
    dev:
      type: duckdb
      threads: 1
      temp_directory: '/.tmp'
      plugins:
        - module: excel

You were able to get the sdoh_county model to complete, is that correct?

chrisgebert commented 5 months ago
* how can we add this to s3 bucket next

To write these output files to s3, we'll need to authenticate to the bucket first, using s3_access_key_id and s3_secret_access_key settings in our profiles.yml file, and then pass the necessary parameters when writing the external file there.

jaanli commented 5 months ago

thank you @chrisgebert ! it ended up just being an issue with my hard disk being full (that's my fault, sorry!).

here's the full output, takes an hour to run on a macbook pro:

healthcare-data/data_processing on  HEAD (aa330b9) [?] via 🐍 v3.10.14 (.venv) 🕙 47m23s ❯ dbt run --select source:social_determinants_of_health+ 20:53:18 Running with dbt=1.7.11 20:53:21 Registered adapter: duckdb=1.7.3 20:53:21 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources. There are 1 unused configuration paths: - models.healthcare_data.example 20:53:21 Found 23 models, 41 sources, 0 exposures, 0 metrics, 527 macros, 0 groups, 0 semantic models 20:53:21 20:53:41 Concurrency: 1 threads (target='dev') 20:53:41 20:53:41 1 of 3 START sql external model main.sdoh_county ............................... [RUN] 20:58:19 1 of 3 OK created sql external model main.sdoh_county .......................... [OK in 278.24s] 20:58:19 2 of 3 START sql external model main.sdoh_tract ................................ [RUN] 21:40:28 2 of 3 OK created sql external model main.sdoh_tract ........................... [OK in 2529.16s] 21:40:28 3 of 3 START sql external model main.sdoh_zipcode .............................. [RUN] 21:58:26 3 of 3 OK created sql external model main.sdoh_zipcode ......................... [OK in 1077.70s] 21:58:26 21:58:26 Finished running 3 external models in 1 hours 5 minutes and 5.16 seconds (3905.16s). 21:58:26 21:58:26 Completed successfully 21:58:26 21:58:26 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

i wonder if we can use skypilot (https://cloud.google.com/blog/topics/hpc/salk-institute-brain-mapping-on-google-cloud-with-skypilot) to accelerate the number of threads, but that is a longer thread.

congrats! will close this out, awesome teamwork here :)