Map all hospital files to common schema

onefact / payless.health-data-processing

Data standards for the Payless Health ecosystem of hospital prices and insurance prices and associated data.

Apache License 2.0

1 stars 0 forks source link

Map all hospital files to common schema #4

Open jaanli opened 1 year ago

jaanli commented 1 year ago

[ ] try https://rmoff.net/2023/03/03/aligning-mismatched-parquet-schemas-in-duckdb/ in colab notebook
[ ] try dagshub + labelstudio
[ ] try app.boilingdata.com

Aggregate the standardized data into parquet files by state, on S3, for four columns at first: CCN, charge, description, state Structure S3 bucket by 2-letter state, e.g. “state=NY” or “state=CA” Name hospital files by EIN Try with regex for “charge” Expand regex / fix up algorithm or try NER / build large list of potential column names If the above works, then expand to include “code” / code_type columns, etc.

Example pipeline with a simple regex: https://github.com/onefact/payless.health-data-processing/blob/main/hospital_price_transparency/230108-processing-raw-hospital-data-to-parquet-file-example.ipynb

cc @rohanbansal12 to help with s3 bucket details !

jaanli commented 1 year ago

https://github.com/onefact/payless.health-data-processing/tree/main/hospital_price_transparency/schema_maps - idea on how to share the standardized schema.

Potential way to break down:

word2vec similarity, token similarity, entity normalization (suffixes, etc)

jaanli commented 1 year ago

Added one potential way to store schemas as json files: https://github.com/onefact/payless.health-data-processing/tree/main/hospital_price_transparency/schema

jaanli commented 1 year ago

End result could be a set of parquet files on s3 that can be queried across states, hospitals, codes, as in this example: https://github.com/onefact/payless.health-data-processing/blob/main/hospital_price_transparency/230108-processing-raw-hospital-data-to-parquet-file-example.ipynb

jaanli commented 1 year ago

@bwindsor22 / @alecstein says we should just use sqlite database to start! i think i agree. then the templating engine can just pull data from it, and it can also live at s3://payless.health :) exciting!!

jaanli commented 1 year ago

easiest could be https://rmoff.net/2023/03/03/aligning-mismatched-parquet-schemas-in-duckdb/

then try dagshub.

jaanli commented 1 year ago

cc @margotwagner and @kyloon - the plan is to use

(1) https://github.com/onefact/dbt_hospital_price_transparency to standardize the data and create 4000 schemas.

(2) https://github.com/onefact/payless.health-internationalization to translate to spanish -- see example: https://github.com/onefact/payless.health-internationalization/blob/main/story-templates-for-hospitals/hospital-price-report.md for example template.

(3) take the template (https://github.com/onefact/payless.health-internationalization/blob/main/story-templates-for-hospitals/hospital-price-report.md) and use jinja2 to create a markdown file using Next.js and markdoc.dev (Stripe's documentation software we are using).

(4) activate incremental static regeneration: https://nextjs.org/docs/basic-features/data-fetching/incremental-static-regeneration