Open jaanli opened 1 year ago
https://github.com/onefact/payless.health-data-processing/tree/main/hospital_price_transparency/schema_maps - idea on how to share the standardized schema.
Potential way to break down:
Added one potential way to store schemas as json files: https://github.com/onefact/payless.health-data-processing/tree/main/hospital_price_transparency/schema
End result could be a set of parquet files on s3 that can be queried across states, hospitals, codes, as in this example: https://github.com/onefact/payless.health-data-processing/blob/main/hospital_price_transparency/230108-processing-raw-hospital-data-to-parquet-file-example.ipynb
@bwindsor22 / @alecstein says we should just use sqlite database to start! i think i agree. then the templating engine can just pull data from it, and it can also live at s3://payless.health
:) exciting!!
easiest could be https://rmoff.net/2023/03/03/aligning-mismatched-parquet-schemas-in-duckdb/
then try dagshub.
cc @margotwagner and @kyloon - the plan is to use
(1) https://github.com/onefact/dbt_hospital_price_transparency to standardize the data and create 4000 schemas.
(2) https://github.com/onefact/payless.health-internationalization to translate to spanish -- see example: https://github.com/onefact/payless.health-internationalization/blob/main/story-templates-for-hospitals/hospital-price-report.md for example template.
(3) take the template (https://github.com/onefact/payless.health-internationalization/blob/main/story-templates-for-hospitals/hospital-price-report.md) and use jinja2
to create a markdown file using Next.js
and markdoc.dev (Stripe's documentation software we are using).
(4) activate incremental static regeneration: https://nextjs.org/docs/basic-features/data-fetching/incremental-static-regeneration
Aggregate the standardized data into parquet files by state, on S3, for four columns at first: CCN, charge, description, state Structure S3 bucket by 2-letter state, e.g. “state=NY” or “state=CA” Name hospital files by EIN Try with regex for “charge” Expand regex / fix up algorithm or try NER / build large list of potential column names If the above works, then expand to include “code” / code_type columns, etc.
Example pipeline with a simple regex: https://github.com/onefact/payless.health-data-processing/blob/main/hospital_price_transparency/230108-processing-raw-hospital-data-to-parquet-file-example.ipynb
cc @rohanbansal12 to help with s3 bucket details !