synthetichealth / synthea

Synthetic Patient Population Simulator
https://synthetichealth.github.io/synthea
Apache License 2.0
2.13k stars 641 forks source link

Too many Covid-19 patients. #747

Open yaniv256 opened 4 years ago

yaniv256 commented 4 years ago

I'm running java -jar synthea-with-dependencies.jar -c synthea.properties -p 100 Minnesota and I'm getting 77% of patients with a Suspected COVID-19 and 75% of patients with a COVID-19 condition. Maybe we're missing some probability parameter?

jawalonoski commented 4 years ago

Wear a mask, wash your hands, and don't go out to eat at a restaurant.

The module is not malfunctioning. It is possible the infection rates are too high. On the other hand, most people in Synthea are getting tested, which is not happening in the real-world, so the results might be skewed. Also, not all of those people are ending up being admitted to the hospital or ICU. As we learn more about the actual results of the pandemic, we will go back and modify the infection rates.

jawalonoski commented 4 years ago

If you want to modify the infection rates, feel free to edit this CSV file: https://github.com/synthetichealth/synthea/blob/master/src/main/resources/modules/lookup_tables/covid19_prob.csv

The time column is a range, where each value is time in milliseconds since January 1, 1970 (i.e. standard Java timestamp).

yaniv256 commented 4 years ago

I'm running synthea on Google Colab (upstream from some deep learning) so I can't really build it every time. Any chance to fix the jar file to produce something more realistic?

jawalonoski commented 4 years ago

I don't know anything about Google Colab... it looks similar to a hosted version of Jupyter notebooks... but even so, I have no idea of how these suggestions will work or if they are possible in your environment:

# Lookup Table Folder location
generate.lookup_tables = modules/lookup_tables/

Any chance to fix the jar file to produce something more realistic?

Define realistic. If you'd like to provide different infection statistics, with the proper peer-reviewed citations, we're happy to take a pull request and update the table.

roberthoyt commented 3 years ago

I assume that the COVID-19 dataset consists of 10K patients who tested positive using nasal swap testing, etc. I see that roughly 6000 had antibody testing (SARS-CoV-2 RNA) with many converting from positive to negative. On further thought, I can't use the dataset to predict who developed COVID versus those who did not. Great dataset nevertheless and could be used for descriptive statistics.

milandpalmer commented 3 years ago

I don't know anything about Google Colab... it looks similar to a hosted version of Jupyter notebooks... but even so, I have no idea of how these suggestions will work or if they are possible in your environment:

  • If you have access to the JAR, you can just replace the covid19_prob.csv file inside the Synthea JAR, either locally or potentially using the jar uf command (see https://docs.oracle.com/javase/tutorial/deployment/jar/update.html).
  • You could also try using a local set of lookup_tables using the --generate.lookup_tables command-line switch. You'll need to provide ALL the lookup tables though or you'll see a lot of exceptions.
# Lookup Table Folder location
generate.lookup_tables = modules/lookup_tables/
  • If you are using a hosted version of Synthea that you do not have access to, then there is nothing you can really do.

Any chance to fix the jar file to produce something more realistic?

Define realistic. If you'd like to provide different infection statistics, with the proper peer-reviewed citations, we're happy to take a pull request and update the table.

Exactly what kind of statistics are you looking for? I would be happy to do some research, I'm sure this data is available now. It would be very helpful if the output from this module was more accurate.

jawalonoski commented 3 years ago

See https://github.com/synthetichealth/synthea/blob/master/src/main/resources/modules/lookup_tables/covid19_prob.csv

This table models the probability of infection over time. We could also add another column, such as State, to represent different geographies.

awatson1978 commented 3 years ago

Also, it's difficult to use the module for developing apps if Covid infection rates are hard-coded at a low number. Gotta find the needles in the haystack. But having an adjustable input at the beginning of the pipeline would be nice. Sometimes we want to generate a population average, sometimes we want to generate a positive cohort, sometimes we want to generate a single patient and model disease progression.

jawalonoski commented 3 years ago

It depends on what your definition of "hard-coded" is. It isn't compiled into the code, it is listed in a configuration file. See the link in the previous comment on October 9th.