orchid-initiative / synthetic-database-project

MIT License
4 stars 2 forks source link

Database Summary Validation #62

Open NickKramer87 opened 10 months ago

NickKramer87 commented 10 months ago

As a database generator, I want to be certain that the synthetic database I am generating is comparable to real-world data so that I can have confidence in the accuracy of the software that will be written using it.


  1. Task "Automated Summary Generation" must be completed first.

Proposed Subtasks:

  1. Create a comparison tool for the summary statistics from step 30 and the real-world summary statistics from HCAI.
  2. Determine an acceptable similarity percentage for the synthetic database summary.
  3. Test the output of the database creation tool to ensure that it meets this threshold.

Acceptance Criteria:

  1. A tool that will compare the synthetic and real summaries and give a percent similarity or possibly a correlation coefficient if that is easier.
  2. A brief report specifying the threshold for deeming a dataset similar and the reasons behond that threshold.
  3. A report where at least five different databases are tested using the tool wherein all five (or 95% if a large number is possible) of the datasets pass the similarity threshold.
rileeki commented 10 months ago

Real-world summary statistics on inpatient discharges in California are available here: https://data.chhs.ca.gov/dataset/hospital-inpatient-characteristics-by-facility-pivot-profile