Closed chrisgebert closed 5 months ago
Amazing, thank you @chrisgebert !!
Testing now:
git clone git@github.com:onefact/healthcare-data.git
git checkout origin/chrisgebert-add_dbt_sources
python3 -m venv .venv && pip install --upgrade pip && source .venv/bin/activate.fish && pip install -r requirements.txt
cd data_processing
dbt deps
syh-dr-csv.zip
and move to ~/data/syh_dr
unzip syh-dr-csv.zip
dbt run
Output:
Re: how to replicate:
I had downloaded these files manually and stored them in the project directory at /data/sdoh, which corresponds to their external_location in the sources.yml file for each of these sources.
Do you have links to where I can also download these files? I assume that is why I get the above errors.
I can put the download script into python so hopefully it can be part of the dbt model too :)
I can also give you AWS SSO credentials in case you want to test the S3 upload on your end!
P.S. This PR is awesome because it will help us replicate https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3072038 (!!). We have been on this for close to a year and this data is just very hard to find, even confirmed with a few associate professors of law at Columbia who work with the Fed...
@jaanli I had downloaded all the files that are linked in this table with the exception of the Codebook files.
Once you have those files locally, you can run just those downstream models by selecting the source:
dbt run --select source:social_determinants_of_health+
I've looked at the Codebook files since submitting the PR, and the Variable Label in these files may be useful in joining in to the output files for additional context. I'll keep that in mind for later development work.
thank you @chrisgebert !
here's the prompt: https://pastebin.com/G1M2wnXa (using https://gist.github.com/jaanli/5def01b7bd674efd6d9008cf1125986d)
added the resulting file from @anthropics.
execute with:
dbt run --select "models/ahrq.gov/sdoh/download*"
output of dbt run --select source:social_determinants_of_health+
:
what are next steps to debug this?
dbt run
jobs to the cloud?)at https://data.payless.health/#hospital_price_transparency/
❯ dbt run --select source:social_determinants_of_health+
12:33:24 Running with dbt=1.7.11
12:33:24 Registered adapter: duckdb=1.7.3
12:33:24 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.healthcare_data.example
12:33:24 Found 23 models, 41 sources, 0 exposures, 0 metrics, 527 macros, 0 groups, 0 semantic models
12:33:24
12:33:25 Concurrency: 1 threads (target='dev')
12:33:25
12:33:25 1 of 3 START sql external model main.sdoh_county ............................... [RUN]
12:38:12 1 of 3 OK created sql external model main.sdoh_county .......................... [OK in 287.04s]
12:38:12 2 of 3 START sql external model main.sdoh_tract ................................ [RUN]
libc++abi: terminating due to uncaught exception of type duckdb::IOException: {"exception_type":"IO","exception_message":"Could not truncate file \".tmp/duckdb_temp_storage-3.tmp\": No space left on device","errno":"28"}
fish: Job 1, 'dbt run --select source:social_…' terminated by signal SIGABRT (Abort)
/Users/me/.pyenv/versions/3.10.14/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
what are next steps to debug this?
This can be fixed by adding a temp_directory
config setting in the dbt profiles.yml file to allow duckdb to spill temp files to your local disk when it requires more than your local machine's memory.
Like this:
healthcare_data:
target: dev
outputs:
dev:
type: duckdb
threads: 1
temp_directory: '/.tmp'
plugins:
- module: excel
You were able to get the sdoh_county
model to complete, is that correct?
* how can we add this to s3 bucket next
To write these output files to s3, we'll need to authenticate to the bucket first, using s3_access_key_id
and s3_secret_access_key
settings in our profiles.yml
file, and then pass the necessary parameters when writing the external file there.
thank you @chrisgebert ! it ended up just being an issue with my hard disk being full (that's my fault, sorry!).
here's the full output, takes an hour to run on a macbook pro:
i wonder if we can use skypilot (https://cloud.google.com/blog/topics/hpc/salk-institute-brain-mapping-on-google-cloud-with-skypilot) to accelerate the number of threads, but that is a longer thread.
congrats! will close this out, awesome teamwork here :)
I'm adding a few things in this PR, mostly to support the addition of datasets from Social Determinants of Health as mentioned in Issue #5 including:
souces.yml
fileprofiles.yml
file1.
sources.yml
This configuration file is to be used by dbt to define different sources, tables, column types, and even local locations of source files in the dbt-duckdb adapter when running relevant dbt commands. This
sources.yml
file will reduce the need for repeated code across existinginpatient
,outpatient
,pharmacy
, andprovider
models for SyH-DR data, though it's important to note that those models are unchanged and so will not refer to these source definitions until modified to do so. If it's preferable to remove these unused source configurations until the models are actually using them, I'm happy to do so to reduce potential confusion. I don't have access to that data and would like to perform some testing or work with someone to do some testing before changing those models to use these sources.I also added sources for the other existing models (
consumer_price_index
andsynthea
) and the same applies to them: the downstream models are unchanged and will not use these source definitions until updated to do so.Finally, in this file, I defined sources to be used by the Social Determinants of Health models that I describe in more detail below. Each of these uses the built-in dbt-duckdb plugin to read the specified Excel sheets into a dbt relation that are then referenced in the model transformation logic.
2.
profiles.yml
This is a very simple duckdb profile configuration file that can be embedded within this project so that different users/contributors do not need to maintain and configure the adapter on their own. If we move towards providing s3 access via keys and secrets, we'd implement that through the use of environment variables that can be read by the dbt-duckdb adapter like I mentioned.
3. Social Determinants of Health models
Three models to unpivot and union together all years from the Social Determinants of Health datasets across the various data files: County Data, Zip Code Data, and Census Tract Data.
I had downloaded these files manually and stored them in the project directory at
/data/sdoh
, which corresponds to theirexternal_location
in the sources.yml file for each of these sources.I'm certain there is a better way to union these files together (using a combination of
dbt-utils.get_relations_by_pattern
anddbt_utils.union_relations
somehow) but I couldn't get the Jinja syntax to work while also unpivoting each of these.