orchid-initiative / synthetic-database-project

MIT License
4 stars 2 forks source link

Keck Hospital Test #64

Open NickKramer87 opened 9 months ago

NickKramer87 commented 9 months ago

As a database generator, I want to validate the accuracy of the synthetic database by comparing the summary statistics of a database of patients that would go to Keck Hospital to the actual summary statistics from Keck Hospital to determine if significant changes are needed to the database generation program.

Requirements:

  1. Task "Automatic Summary Generation" and "Database Summary Validation" must be completed prior to start.

Potential subtasks:

  1. A radius or list of zip codes for patients who are likely to go to Keck Hospital.
  2. A synthetic database of patients within the area from subtask 1.
  3. A passing grade according to the success criterium from subtask 36.

Acceptance Criteria:

  1. A tool that will compare the synthetic and real summaries and give a percent similarity or possibly a correlation coefficient if that is easier.
  2. A brief report specifying the threshold for deeming a dataset similar and the reasons behond that threshold.
  3. A report where at least five different databases are tested using the tool from task 3 wherein all five (or 95% if a large number is possible) of the datasets pass the similarity threshold.
TravisHaussler commented 8 months ago

Its possible to override the hospitals list to only include Keck and then set the location for the run to be somewhat close. We did a test run with Pasadena and it produced results.

TravisHaussler commented 8 months ago

Sub task notes:

rileeki commented 8 months ago

Medicare did an analysis comparing Synthea's data to real Medicare claims data. This might be something of a model for the analysis. See pages 13-27 of this document.

TravisHaussler commented 8 months ago

https://github.com/orchid-initiative/synthetic-database-project/blob/main/csv_formatted_data_09-11-2023_134827.csv

Here is a 500 male and 500 female run of synthea in los angeles area with only keck as a possible hospital

rileeki commented 8 months ago

Thanks, @TravisHaussler! lol They all still live in Massachusetts somehow.

rileeki commented 8 months ago

Hm, the layout of this doesn't seem quite right and it looks like the diagnosis codes are still SNOMED. Could you upload the log and fixed-width output too? @TravisHaussler

TravisHaussler commented 8 months ago

I’ll take a look tomorrow morning. I am surprised the patients themselves didn’t generate in the right location at least, that seems strange to me. I’ll check the diagnosis too, that’s odd (we expect the procedure ones to be still though)

On Thu, Nov 9, 2023 at 4:19 PM rileeki @.***> wrote:

Hm, the layout of this doesn't seem quite right and it looks like the diagnosis codes are still SNOMED. Could you upload the log and fixed-width output too? @TravisHaussler https://github.com/TravisHaussler

— Reply to this email directly, view it on GitHub https://github.com/orchid-initiative/synthetic-database-project/issues/64#issuecomment-1804879146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZL3CWKUWLQUBG3Q5WXD3C3YDVXIRAVCNFSM6AAAAAA6H67RDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBUHA3TSMJUGY . You are receiving this because you were mentioned.Message ID: @.*** com>

TravisHaussler commented 8 months ago

Here is what I see for the first bunch of rows from that file (taking out the mass of extra blank diagnosis and procedure code fields):
The diagnosis codes are ICD, while the Procedure codes are still SNOMED and I see all the addresses in CA

Type of Care | Facility Identification Number | Facility Name | Date of Birth | Sex | Ethnicity | Race | Not in Use 1 | Admission Date | Point of Origin | Route of Admission | Type of Admission | Discharge Date | Principal Diagnosis | Present on Admission for Principal Diagnosis | Diagnosis 2 | Present on Admission 2 | Diagnosis 3 | Present on Admission 3 | Diagnosis Codes | Present on Admission | Principal Procedure Code | Principal Procedure Date | Procedure Code 2 | Procedure Date 2 | Procedure Codes | Procedure Dates | External Causes of Morbidity and Present on Admission | Patient SSN | Disposition of Patient | Total Charges | Abstract Record Number (Optional) | Prehospital Care & Resuscitation - DNR Order | Payer Category | Type of Coverage | Plan Code Number | Preferred Spoken Language | Patient Address - Address Number and Street Name | Patient Address - City | Patient Address - State | Patient Address - Zip Code | Patient Address - Country Code | Patient Address - Homeless Indicator -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 1 | 10735 | KECK MEDICAL CENTER OF USC | 19621021 | F | E2 | R5 |   | 19980816 | 6 | 3 | 2 | 19980817 | Z9851 | N |   |   |   |   | ('Z9851',) | ('N',) |   |   |   |   |   |   |   | 999713129 | 85 | 12088 | 1 | Y | 6 |   |   |   | 140 Flatley Arcade Apt 56 | Los Angeles | California | 90036 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19621021 | F | E2 | R5 |   | 19980816 | 1 | 3 | 3 | 19980817 | Z9851 | N |   |   |   |   | ('Z9851',) | ('N',) |   |   |   |   |   |   |   | 999713129 | 84 | 12088 | 2 | N | 6 |   |   |   | 140 Flatley Arcade Apt 56 | Los Angeles | California | 90036 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19621021 | F | E2 | R5 |   | 19980816 | 6 | 3 | 4 | 19980817 | Z9851 | N |   |   |   |   | ('Z9851',) | ('N',) |   |   |   |   |   |   |   | 999713129 | 85 | 12088 | 3 | N | 6 |   |   |   | 140 Flatley Arcade Apt 56 | Los Angeles | California | 90036 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19621021 | F | E2 | R5 |   | 19980816 | 8 | 3 | 5 | 19980817 | Z9851 | N |   |   |   |   | ('Z9851',) | ('N',) |   |   |   |   |   |   |   | 999713129 | 50 | 12088 | 4 | Y | 6 |   |   |   | 140 Flatley Arcade Apt 56 | Los Angeles | California | 90036 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19621021 | F | E2 | R5 |   | 20180807 | 1 | 1 | 1 | 20180810 | S83519 | U | S83519 | Y |   |   | ('S83519', 'S83519') | ('U', 'Y') | 699253003 | 20180807 | 133899007 | 20180807 | ('699253003', '133899007') | ('20180807', '20180807') |   | 999713129 | 86 | 29949 | 5 | Y | 6 |   |   |   | 140 Flatley Arcade Apt 56 | Los Angeles | California | 90036 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19621021 | F | E2 | R5 |   | 20180807 | 2 | 3 | 9 | 20180810 | S83519 | U | S83519 | Y |   |   | ('S83519', 'S83519') | ('U', 'Y') | 699253003 | 20180807 | 133899007 | 20180807 | ('699253003', '133899007') | ('20180807', '20180807') |   | 999713129 | 21 | 29949 | 6 | N | 6 |   |   |   | 140 Flatley Arcade Apt 56 | Los Angeles | California | 90036 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19621021 | F | E2 | R5 |   | 20180807 | E | 1 | 1 | 20180810 | S83519 | U | S83519 | Y |   |   | ('S83519', 'S83519') | ('U', 'Y') | 699253003 | 20180807 | 133899007 | 20180807 | ('699253003', '133899007') | ('20180807', '20180807') |   | 999713129 | 81 | 29949 | 7 | Y | 6 |   |   |   | 140 Flatley Arcade Apt 56 | Los Angeles | California | 90036 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19621021 | F | E2 | R5 |   | 20180807 | 5 | 3 | 3 | 20180810 | S83519 | U | S83519 | Y |   |   | ('S83519', 'S83519') | ('U', 'Y') | 699253003 | 20180807 | 133899007 | 20180807 | ('699253003', '133899007') | ('20180807', '20180807') |   | 999713129 | 62 | 29949 | 8 | Y | 6 |   |   |   | 140 Flatley Arcade Apt 56 | Los Angeles | California | 90036 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19971128 | F | E2 | R5 |   | 20230308 | F | 3 | 2 | 20230309 | Z9851 | W | Z302 | W | Z9851 | N | ('Z9851', 'Z302', 'Z9851') | ('W', 'W', 'N') | 287664005 | 20230308 | 133899007 | 20230308 | ('287664005', '133899007') | ('20230308', '20230308') |   | 999904365 | 91 | 16155 | 9 | N | 6 |   |   |   | 707 Considine Way Apt 91 | Los Angeles | California | 90061 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19971128 | F | E2 | R5 |   | 20230308 | 4 | 3 | 9 | 20230309 | Z9851 | W | Z302 | W | Z9851 | N | ('Z9851', 'Z302', 'Z9851') | ('W', 'W', 'N') | 287664005 | 20230308 | 133899007 | 20230308 | ('287664005', '133899007') | ('20230308', '20230308') |   | 999904365 | 0 | 16155 | 10 | Y | 6 |   |   |   | 707 Considine Way Apt 91 | Los Angeles | California | 90061 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19971128 | F | E2 | R5 |   | 20230308 | 1 | 3 | 9 | 20230309 | Z9851 | W | Z302 | W | Z9851 | N | ('Z9851', 'Z302', 'Z9851') | ('W', 'W', 'N') | 287664005 | 20230308 | 133899007 | 20230308 | ('287664005', '133899007') | ('20230308', '20230308') |   | 999904365 | 87 | 16155 | 11 | Y | 6 |   |   |   | 707 Considine Way Apt 91 | Los Angeles | California | 90061 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19971128 | F | E2 | R5 |   | 20230308 | 1 | 3 | 5 | 20230309 | Z9851 | W | Z302 | W | Z9851 | N | ('Z9851', 'Z302', 'Z9851') | ('W', 'W', 'N') | 287664005 | 20230308 | 133899007 | 20230308 | ('287664005', '133899007') | ('20230308', '20230308') |   | 999904365 | 84 | 16155 | 12 | Y | 6 |   |   |   | 707 Considine Way Apt 91 | Los Angeles | California | 90061 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19690208 | F | E2 | R5 |   | 20040820 | 8 | 3 | 5 | 20040821 | Z9851 | Y |   |   |   |   | ('Z9851',) | ('Y',) |   |   |   |   |   |   |   | 999925265 | 83 | 9185 | 13 | N | 8 |   |   |   | 320 Bayer Crossing Suite 37 | Los Angeles | California | 90291 | US |   1 | 10735 | KECK MEDICAL CENTER OF USC | 19690208 | F | E2 | R5 |   | 20040820 | 6 | 3 | 4 | 20040821 | Z9851 | Y |   |   |   |   | (' |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |  
rileeki commented 8 months ago

@TravisHaussler You are totally right. I'm not even sure what I was looking at... I'm sorry about that!

rileeki commented 8 months ago

@TravisHaussler I'll pick this up for the next two weeks. I plan to slice and dice the data you provided and provide a report at our next check-in comparing this dataset to the publicly available summary statistics.

TravisHaussler commented 8 months ago

Ok, let me know if you want it run with larger numbers too. I hit programmatic errors at first reading in the huge synthea csvs but that’s a little better now that we only load specific desired columns. I can also run it a handful of separate times and concatenate the results I guess, but that’s maybe slightly statistically different

On Fri, Nov 17, 2023 at 1:29 PM rileeki @.***> wrote:

@TravisHaussler https://github.com/TravisHaussler I'll pick this up for the next two weeks. I plan to slice and dice the data you provided and provide a report at our next check-in comparing this dataset to the publicly available summary statistics.

— Reply to this email directly, view it on GitHub https://github.com/orchid-initiative/synthetic-database-project/issues/64#issuecomment-1817130479, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZL3CWJJAADJ4DCP4OJBWULYE7JJZAVCNFSM6AAAAAA6H67RDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJXGEZTANBXHE . You are receiving this because you were mentioned.Message ID: @.*** com>