Open jaanli opened 6 months ago
I started looking into this over the weekend and it appears there are 3 different datasets here:
I downloaded these files and added sources for them to the sources.yml
file I had started working on. Unioning all the separate year files together is pretty straightforward using dbt-duckdb's built-in excel plugin combined with a few dbt macros. I'll have a pull request with those updates ready maybe later this week. The only question I have is should the output be partitioned in any way. The largest parquet file (tract data) is 1.6GB once generated.
There seem to be a number of separate for-purchase database files at various levels, all of which require data use agreements. Will need additional support here in order to get access to the data at all.
These data files are easy to access, in multiple formats, and there is an additional repo with very good documentation and workshops too. I'm guessing the Full Year Population Characteristics datasets are the most useful, but I'm still trying to figure that out.
Does my understanding here match yours @jaanli?
That sounds great, you're the best @chrisgebert !
This is an excellent test of the infrastructure we have to share work and resources here with the wider community.
I can add these parquet files to our s3 bucket: https://data.payless.health/ (there are several multi-gigabyte parquet files there already, such as the one from the first homework of last summer's course: https://colab.research.google.com/github/onefact/datathinking.org-codespace/blob/main/notebooks/princeton-university/week-1-visualizing-33-million-phone-calls-in-new-york-city.ipynb -- this actually ended up in a motherduck blog post! https://motherduck.com/blog/introducing-column-explorer/)
Or I'm happy to give you permissions to upload as well, I don't know how to set up dbt to connect to s3 yet :S
Awesome work, go team!
We can define s3 settings that duckdb will use when writing the external file within the dbt-duckdb profile. Like this:
default:
outputs:
dev:
type: duckdb
path: /tmp/dbt.duckdb
settings:
s3_region: my-aws-region
s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
target: dev
Or we can use the use_credential_provider
property, which I haven't done before but I believe supports any AWS method of authentication.
Hi @jaanli,
I don't have permissions to create a pull request on this repository. Would you rather I create a fork to commit my code or how would you prefer to handle contributions?
Thanks for letting me know @chrisgebert ! Just added you, please let me know if that doesn't work. Excited to try out the s3 settings! Haven't used that before :)
If anyone has cycles, these other datasets would be good to port to our duckdb/dbt/observable framework workflow trifecta:
https://www.ahrq.gov/sdoh/data-analytics.html
Example notebook to ease this process of importing and getting boilerplate dbt/duckdb models:
https://colab.research.google.com/github/onefact/electronic-health-records/blob/main/notebooks/loading_physionet_mimic_data.ipynb
Comment here if anyone is interested in tackling this!