synthetichealth / synthea

Synthetic Patient Population Simulator
https://synthetichealth.github.io/synthea
Apache License 2.0
2.2k stars 657 forks source link

Comparison with Real vs Generated Data #854

Open BoPengGit opened 3 years ago

BoPengGit commented 3 years ago

Hi,

My name is Bo and I'm a Data Scientist for Dovel Technologies. I'm thinking about competition in the Synthetic Health Data Challenge.

Here is my Kaggle machine learning profile: kaggle.com/bopengiowa I talked to the members of the data challenge and want to build a machine learning model on the synthetically generated data from Synthea and then the real data that Synthea used to model the generated data on.

Can you help me get started and how can I get the data that Synthea generated and also the real data that the generated data was created off of?

Thank you,

BoPengGit commented 3 years ago

Basically I am trying to validate the accuracy of a machine learning model based on the generated data for the Synthetic Health Data Challenge.

I'm trying to:

Step 1: Set apart some real data for validation. Step 2: Take the rest of the real data and build a machine learning model and log it's accuracy. Step 3: Take the real data and use Synthea to generate fake data. Step 4: Use the fake data and build a machine learning model and then validate the accuracy on the same hold out real data that the machine learning model trained on real data was validated on.

How do I get the real and fake data using Synthea?

Thank you for your help!

jawalonoski commented 3 years ago

@BoPengGit You can download pregenerated synthetic data from https://synthea.mitre.org/downloads, or if you want the latest improvements, you can generate the data yourself using the source code (this git repository).

There is no "real data" for you to access. That is a fundamental misunderstanding of how Synthea works. Synthea does not use a real data set, instead it uses models of disease progression based off of medical literature, and a modeling and simulation process generates the longitudinal medical history of the synthetic patients according to these disease models. See https://github.com/synthetichealth/synthea/wiki/Getting-Started or https://doi.org/10.1093/jamia/ocx079 for more information.

Even if I did have "real data" (which I do not), I would be unable to share it due to Personally Identifiable Information (PII) and Protected Health Information (PHI) that it might contain. Even if there was data that was deidentified, there would still be obstacles to sharing it -- which is why we are generating synthetic data in the first place.

fdefalco commented 3 years ago

While I sadly came across the challenge too late to participate, my organization and research communities I belong to have access to 500+ million lives worth of "real data". While I wouldn't be able to provide access to any of it either, one of our goals this year is going to be to benchmark the synthea data against some of our "real data" sources. Hopefully we can provide feedback to the disease models to improve their characteristics to more closely represent what we find in the wild. We use Synthea data extensively for testing our tools and methods, its the least we can do.

jawalonoski commented 3 years ago

one of our goals this year is going to be to benchmark the synthea data against some of our "real data" sources. Hopefully we can provide feedback to the disease models to improve their characteristics to more closely represent what we find in the wild.

@fdefalco That would be greatly appreciated. That type of feedback is very valuable to the project. One thing we could possibly do with that information would be to create a series of Issues or "contribution requests" each targeted to specific disease model improvements.

fdefalco commented 3 years ago

Agreed. That being said, and to make this more practical, do you have a particular module you would consider a best case for this initial attempt at evaluation?

jawalonoski commented 3 years ago

Knee-jerk reaction, I would say diabetes or cardiovascular related models, but I know there are people out there looking at revising some of those today. If you want to peer where no one is looking, maybe oncology. But really, you could almost pick anything. COVID needs a refresh and tune up. You could also look for obvious holes or gaps in the data. What are big items that are not being covered, but should be?