orchid-initiative / synthetic-database-project

MIT License
4 stars 2 forks source link

Simulate the 2021 data for Keck Hospital of USC (OSHPD #106194219) & compare to publicly available stats #38

Open rileeki opened 1 year ago

rileeki commented 1 year ago

As a product owner, I want to see how much we can use Synthea's built-in options and how much we'll have to code ourselves so that I can evaluate our approach and consider revising the target output.

This task will require some more investigation into what inputs we give when running Synthea. As of this writing, the program just specifies the entire simulated population count by gender, then taking all the inpatient records we see and hard-coding the specified hospital ID. For this task, I want to see a dataset with a specific number of inpatient hospital records in a given time period, and we want those people to live realistically near the specified hospital.

I'm not sure how to get there... if we're contorting the Synthea output a whole bunch to get the specific discharge record counts to match, we might want to re-evaluate our approach. I'm also interested in seeing how our simulated output lines up with the real data for the hospital. Getting a concrete example will help us decide if we need to change the goal for our output.

rileeki commented 1 year ago

Image

rileeki commented 1 year ago

I accessed the above stats through the HCAI website: https://data.chhs.ca.gov/dataset/hospital-inpatient-characteristics-by-facility-pivot-profile

I'm attaching a copy here.

2021pddpivot (3).xlsx

TravisHaussler commented 1 year ago

It looks like we can specify an override for the hospitals.csv (contained in the synthea jar file). I created the "overrides" folder in our base directory where one could store these and left just the keck line in hospitals.csv there. In synthea settings I specify: generate.providers.hospitals.default_file = overrides/hospitals.csv. Trying a run now