synthetichealth / synthea

Synthetic Patient Population Simulator
https://synthetichealth.github.io/synthea
Apache License 2.0
2.19k stars 657 forks source link

CSV file for patients.csv has extra column (between #1358

Closed thondeboer closed 1 year ago

thondeboer commented 1 year ago

What happened?

I ran run_synthea with default options to create test patient, and the patient CSV file contained this row wiht an extra value:

d12cae9-54b6-a5b1-4481-b49ae3b31e1a,1984-12-30,,999-56-7435,S99992971,X66900476X,Mr.,César846,Bernal586,,,M,white,hispanic,M,Ponce  Puerto Rico  PR,936 Cummings Burg Apt 64,Haverhill,Massachusetts,Essex County,25009,01835,42.769551061336124,-71.04918584824105,9201.74,732602.04,9388

There is an extra value "25009" between County and ZIP code. The code was not found in the FHIR nor CCDA data.

On a different run with different seed, the field was empty but still present, so it seems reproducible.

Environment

- OS: Linux 22.04
- Java: openjdk version "11.0.20" 2023-07-18

Relevant log output

I made some small changes to the config file and am using the dev version, build locally

# Starting with a properties file because it requires no additional dependencies

exporter.baseDirectory = ./output/
exporter.use_uuid_filenames = false
exporter.subfolders_by_id_substring = false
# exporters that use XML or JSON can enable or disable 'pretty printing'
exporter.pretty_print = true
# number of years of history to keep in exported records, anything older than this may be filtered out
# set years_of_history = 0 to skip filtering altogether and keep the entire history
exporter.years_of_history = 10
# split records allows patients to have one record per provider organization
exporter.split_records = false
exporter.split_records.duplicate_data = false
exporter.metadata.export = true
exporter.ccda.export = true
exporter.fhir.export = true
exporter.fhir_stu3.export = false
exporter.fhir_dstu2.export = false
exporter.fhir.use_shr_extensions = false
exporter.fhir.use_us_core_ig = true
exporter.fhir.us_core_version = 5.0.1
exporter.fhir.transaction_bundle = true
# using bulk_data=true will ignore exporter.pretty_print
exporter.fhir.bulk_data = false
# included_ and excluded_resources list out the resource types to include/exclude in the csv exporters.
# only one of these may be set at a time, if both are set then both will be ignored.
# if neither is set, then all resource types will be included.
# note the Patient and Encounter resources will always be included, even if specifically listed as excluded here
exporter.fhir.included_resources =
exporter.fhir.excluded_resources =
exporter.groups.fhir.export = false
exporter.hospital.fhir.export = true
exporter.hospital.fhir_stu3.export = false
exporter.hospital.fhir_dstu2.export = false
exporter.practitioner.fhir.export = true
exporter.practitioner.fhir_stu3.export = false
exporter.practitioner.fhir_dstu2.export = false
exporter.encoding = UTF-8
exporter.json.export = false
exporter.json.include_module_history = false
exporter.csv.export = true
# if exporter.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh
exporter.csv.append_mode = true
# if exporter.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder
exporter.csv.folder_per_run = false
# included_files and excluded_files list out the files to include/exclude in the csv exporter
# only one of these may be set at a time, if both are set then both will be ignored
# if neither is set, then all files will be included
# see list of files at: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary
# include filenames separated with a comma, ex: patients.csv,procedures.csv,medications.csv
# NOTE: the csv exporter does not actively delete files, so if Run 1 you included a file, then Run 2 you exclude that file, the version from Run 1 will still be present
exporter.csv.included_files =
exporter.csv.excluded_files = patient_expenses.csv

exporter.cpcds.export = false
exporter.cpcds.append_mode = false
exporter.cpcds.folder_per_run = false
exporter.cpcds.single_payer = false

exporter.bfd.export = false
exporter.bfd.require_code_maps = true
exporter.bfd.export_missing_codes = true
exporter.bfd.bene_id_start = -1000000
exporter.bfd.clm_id_start = -100000000
exporter.bfd.clm_grp_id_start = -100000000
exporter.bfd.pde_id_start = -100000000
exporter.bfd.fi_doc_cntl_num_start = -100000000
exporter.bfd.carr_clm_cntl_num_start = -100000000
exporter.bfd.mbi_start = 1S00-E00-AA00
exporter.bfd.hicn_start = T01000000A
exporter.bfd.partc_contract_start = Y0001
exporter.bfd.partc_contract_count = 10
exporter.bfd.plan_benefit_package_start = 800
exporter.bfd.plan_benefit_package_count = 5
exporter.bfd.partd_contract_start = Z0001
exporter.bfd.partd_contract_count = 10
exporter.bfd.clia_labs_start = 00A0000000
exporter.bfd.clia_labs_count = 10
exporter.bfd.cutoff_date=20140529

exporter.cdw.export = false
exporter.text.export = false
exporter.text.per_encounter_export = false
exporter.clinical_note.export = false

# parameters for symptoms export
exporter.symptoms.csv.export = false
# selection mode of conditions or symptom export: 0 = conditions according to  exporter.years_of_history. other values = all conditions (entire history)
exporter.symptoms.mode = 0
# if exporter.symptoms.csv.append_mode = true, then each run will add new data to any existing CSVs. if false, each run will clear out the files and start fresh
exporter.symptoms.csv.append_mode = false
# if exporter.symptoms.csv.folder_per_run = true, then each run will have CSVs placed into a unique subfolder. if false, each run will only use the top-level csv folder
exporter.symptoms.csv.folder_per_run = false
exporter.symptoms.text.export = false

# enable searching for custom exporter implementations
exporter.enable_custom_exporters = true

# the number of patients to generate, by default
# this can be overridden by passing a different value to the Generator constructor
generate.default_population = 1

# the number of threads to use for the generator, set the value to -1 to match the number of
# available processors (as per Runtime.getRuntime().availableProcessors())
# defaults to -1 if not specified
generate.thread_pool_size = -1

generate.log_patients.detail = simple
# options are "none", "simple", or "detailed" (without quotes). defaults to simple if another value is used
# none = print nothing to the console during generation
# simple = print patient names once they are generated.
# detailed = print patient names, atributes, vital signs, etc..  May slow down processing

generate.timestep = 604800000
# time is in ms
# 1000 * 60 * 60 * 24 * 7 = 604800000

# default demographics is every city in the US
generate.demographics.default_file = geography/demographics.csv
generate.geography.zipcodes.default_file = geography/zipcodes.csv
generate.geography.country_code = US
generate.geography.timezones.default_file = geography/timezones.csv
generate.geography.foreign.birthplace.default_file = geography/foreign_birthplace.json
generate.geography.sdoh.default_file = geography/sdoh.csv

# Lookup Table Folder location
generate.lookup_tables = modules/lookup_tables/

# Set to true if you want every patient to be dead.
generate.only_dead_patients = false
# Set to true if you want every patient to be alive.
generate.only_alive_patients = false
# If both only_dead_patients and only_alive_patients are set to true,
# It they will both default back to false

# if criteria are provided, (for example, only_dead_patients, only_alive_patients, or a "patient keep module" with -k flag)
# this is the maximum number of times synthea will loop over a single slot attempting to produce a matching patient.
# after this many failed attempts, it will throw an exception.
# set this to 0 to allow for unlimited attempts (but be aware of the possibility that it will never complete!)
generate.max_attempts_to_keep_patient = 1000

# if true, tracks and prints out details of transition tables for each module upon completion
# note that this may significantly slow down processing, and is intended primarily for debugging
generate.track_detailed_transition_metrics = false

# If true, person names have numbers appended to them to make them more obviously fake
generate.append_numbers_to_person_names = true

# Probability of each person having a middle name. 0 is zero, 1.0 is 100% chance.
generate.middle_names = 0.80

# if true, the entire population will use veteran prevalence data
generate.veteran_population_override = false

# these should add up to 1.0
# weighting and categories are inspired by the following but there are no specific hard numbers to point to
# http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1694190/pdf/amjph00543-0042.pdf
# http://www.ncbi.nlm.nih.gov/pubmed/8122813
generate.demographics.socioeconomic.weights.income = 0.2
generate.demographics.socioeconomic.weights.education = 0.7
generate.demographics.socioeconomic.weights.occupation = 0.1

generate.demographics.socioeconomic.score.low = 0.0
generate.demographics.socioeconomic.score.middle = 0.25
generate.demographics.socioeconomic.score.high = 0.66

generate.demographics.socioeconomic.education.less_than_hs.min = 0.0
generate.demographics.socioeconomic.education.less_than_hs.max = 0.5
generate.demographics.socioeconomic.education.hs_degree.min = 0.1
generate.demographics.socioeconomic.education.hs_degree.max = 0.75
generate.demographics.socioeconomic.education.some_college.min = 0.3
generate.demographics.socioeconomic.education.some_college.max = 0.85
generate.demographics.socioeconomic.education.bs_degree.min = 0.5
generate.demographics.socioeconomic.education.bs_degree.max = 1.0

# The average family size in the US is 3.13. The 2010 FPL for a 3-person household is $18310. Tuned it to $17550 for realistic medicaid/ACA enrollments.
generate.demographics.socioeconomic.income.poverty = 17550
generate.demographics.socioeconomic.income.high = 75000

generate.birthweights.default_file = birthweights.csv
generate.birthweights.logging = false

# in Massachusetts, the individual insurance mandate became law in 2006
# in the US, the Affordable Care Act become law in 2010,
# and individual and employer mandates took effect in 2014.
# mandate.year will determine when individuals with an occupation score above mandate.occupation
# receive employer mandated insurance (aka "private" insurance).
# prior to mandate.year, anyone with income greater than the annual cost of an insurance plan
# will purchase the insurance.
generate.insurance.mandate.year = 2006
generate.insurance.mandate.occupation = 0.2

# Defines what percent of insurance premiums are covered by employers, when employer-covered.
# According to [https://www.kff.org/report-section/ehbs-2021-summary-of-findings/],
# the average employee premium contribution is 0.17 and employers pay 0.83.
generate.insurance.employer_coverage = 0.83

# Default Costs, to be used for pricing something that we don't have a specific price for
# -- $500 for procedures is completely invented
generate.costs.default_procedure_cost = 500.00
# -- $255 for medications - also invented
generate.costs.default_medication_cost = 255.00
# -- Encounters billed using avg prices from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096340/
# -- Adjustments for initial or subsequent hospital visit and level/complexity/time of encounter
# -- not included. Assume initial, low complexity encounter (Tables 4 & 6)
generate.costs.default_encounter_cost = 125.00
# -- https://www.nytimes.com/2014/07/03/health/Vaccine-Costs-Soaring-Paying-Till-It-Hurts.html
# -- currently all vaccines cost $136.
generate.costs.default_immunization_cost = 136.00
generate.costs.default_lab_cost = 100.00
# -- assumes device costs are included in procedure cost, if not add to costs/devices.csv
generate.costs.default_device_cost = 0.00
# -- assumes supply costs are included in procedure cost, if not add to costs/supplies.csv
generate.costs.default_supply_cost = 0.00

# Providers
generate.providers.hospitals.default_file = providers/hospitals.csv
generate.providers.longterm.default_file = providers/longterm.csv
generate.providers.nursing.default_file = providers/nursing.csv
generate.providers.rehab.default_file = providers/rehab.csv
generate.providers.hospice.default_file = providers/hospice.csv
generate.providers.dialysis.default_file = providers/dialysis.csv
generate.providers.homehealth.default_file = providers/home_health_agencies.csv
generate.providers.veterans.default_file = providers/va_facilities.csv
generate.providers.urgentcare.default_file = providers/urgent_care_facilities.csv
generate.providers.primarycare.default_file = providers/primary_care_facilities.csv
generate.providers.ihs.hospitals.default_file = providers/ihs_facilities.csv
generate.providers.ihs.primarycare.default_file = providers/ihs_centers.csv

# Provider selection behavior
# How patients select a provider organization:
#  nearest - select the closest provider. See generate.providers.maximum_search_distance
#  random  - select randomly.
#  network - select a random provider in your insurance network. same as random except it changes every time the patient switches insurance provider.
#  medicare - select the nearest provider that can bill Medicare. If no Medicare provider is found, it defaults back to "nearest".
generate.providers.selection_behavior = nearest

# if a provider cannot be found for a certain type of service,
# this will default to the nearest hospital.
generate.providers.default_to_hospital_on_failure = true

# minimum number of providers linked per patient
# if this number is not met it re-runs the simulation
generate.providers.minimum = 1

# maximum distance to look for a provider for a given patient, in km
# set to 10 degrees lat/lon to support the model that veterans only seek care at VA facilities
generate.providers.maximum_search_distance = 1000

# Payers
generate.payers.insurance_companies.default_file = payers/insurance_companies.csv
generate.payers.insurance_plans.default_file = payers/insurance_plans.csv
generate.payers.insurance_plans.eligibilities_file = payers/insurance_eligibilities.csv
generate.payers.insurance_companies.medicare = Medicare
generate.payers.insurance_companies.medicaid = Medicaid
generate.payers.insurance_companies.dual_eligible = Dual Eligible
# The percentage of a person's income that they are willing to spend on health insurance premiums.
generate.payers.insurance_plans.income_premium_ratio = 0.034
# The chance of rejection
# Plan selection behavior
# How patients select a plan:
#  best_rates - select plans with best rates for person's existing conditions and medical needs
#  random  - select plans randomly.
#  priority  - select plans based on the priority level defined in the insurance plans file.
generate.payers.selection_behavior = priority

# Payer adjustment behavior
# How payers adjust claims:
#  none - the payer reimburses each claim by the full amount.
#  fixed - the payer adjusts each claim by a fixed rate (set by adjustment_rate)
#  random  - the payer adjusts each claim by a random rate (between zero and adjustment_rate).
generate.payers.adjustment_behavior = none
# Payer adjustment rate should be between zero and one (0.00 - 1.00), where 0.05 is 5%.
generate.payers.adjustment_rate = 0.10

# Experimental feature. Patients will miss care if true, but side-effects of missing that care
# are not handled. Additionally, the path the disease module might take may no longer make sense.
# It might assume things occurred that haven't actually happened it. Use with care.
generate.payers.loss_of_care = false

# Add a FHIR terminology service URL to enable the use of ValueSet URIs within code definitions.
# generate.terminology_service_url = https://r4.ontoserver.csiro.au/fhir

# Quit Smoking
lifecycle.quit_smoking.baseline = 0.01
lifecycle.quit_smoking.timestep_delta = -0.01
lifecycle.quit_smoking.smoking_duration_factor_per_year = 1.0

# Quit Alcoholism
lifecycle.quit_alcoholism.baseline = 0.001
lifecycle.quit_alcoholism.timestep_delta = -0.001
lifecycle.quit_alcoholism.alcoholism_duration_factor_per_year = 1.0

# Adherence
lifecycle.adherence.baseline = 0.05

# set this to true to enable randomized "death by natural causes"
# highly recommended if "only_dead_patients" is true
lifecycle.death_by_natural_causes = false

# set this to enable "death by loss of care" or missed care,
# e.g. not covered by insurance or otherwise unaffordable.
# only functional if "generate.payers.loss_of_care" is also true.
lifecycle.death_by_loss_of_care = false

# Use physiology simulations to generate some VitalSigns
physiology.generators.enabled = false

# Allow physiology module states to be executed
# If false, all Physiology state objects will immediately redirect to the state defined in
# the alt_direct_transition field
physiology.state.enabled = false

# set to true to introduce errors in height, weight and BMI observations for people
# under 20 years old
growtherrors = false
thondeboer commented 1 year ago

I found a similar extra column for PAYERS (Column 5 seems extra) and PROVIDERS (The last column seems extra).

jawalonoski commented 1 year ago

Thank you for the report... we'll look into it.

Do you still have the console output?

I'm interested in the end of that output... For example,

Running with options:
Population: 5
Seed: 1693320322058
Provider Seed:1693320322058
Reference Time: 1693320322058
Location: Massachusetts
Min Age: 0
Max Age: 140
3 -- Ezra452 Ferry570 (18 y/o M) Concord, Massachusetts  (25625)
1 -- Johanne551 Gislason620 (24 y/o F) Randolph, Massachusetts  (34253)
2 -- Marta91 Villareal516 (25 y/o F) Springfield, Massachusetts  (36679)
4 -- Stefany238 Connelly992 (27 y/o F) Boston, Massachusetts  (38650)
5 -- Dolores502 Alejandra902 Alanis890 (62 y/o F) Westborough, Massachusetts DECEASED (120079)
5 -- Natalia964 Esperanza675 Padilla483 (68 y/o F) Westborough, Massachusetts  (96113)
Records: total=6, alive=5, dead=1
RNG=5
Clinician RNG=5643
jawalonoski commented 1 year ago

Or, the relevant ./output/metadata/*.json file.

thondeboer commented 1 year ago

I have this one for the retry:

Running with options:
Population: 1
Seed: 101
Provider Seed:1693332487663
Reference Time: 1693332487663
Location: Massachusetts
Min Age: 0
Max Age: 140
1 -- Garrett899 VonRueden376 (57 y/o M) Norfolk, Massachusetts  (84735)
Records: total=1, alive=1, dead=0
RNG=1
Clinician RNG=5644

with the CSV file being this:

45746e91-d759-c73c-b033-56587d814cca,1966-05-04,,999-60-4143,S99962688,X73514809X,Mr.,Garrett899,VonRueden376,,,M,white,nonhispanic,M,Franklin  Massachusetts  US,419 Rogahn Alley,Norfolk,Massachusetts,Norfolk County,,00000,42.1328679377805,-71.29048599718199,55929.45,451474.93,749961

In that case, the zip code is missing, completely and not the one in the output (84735)

thondeboer commented 1 year ago

And this is the metadata file

{
  "runID": "984705fd-dc7f-4ee8-99da-65a9a95da183",
  "seed": 101,
  "clinicianSeed": 1693332487663,
  "referenceTime": "20230829",
  "endTime": "20230829",
  "version": "master-branch-latest\n",
  "patientCount": 1,
  "providerCount": 1327,
  "payerCount": 9,
  "javaVersion": "11.0.20",
  "generate.thread_pool_size": -1,
  "generatorThreads": 20,
  "runStartTime": "2023-08-29T18:08:07Z",
  "runTimeInSeconds": 9,
  "exporter.years_of_history": "10",
  "state": "Massachusetts",
  "modules": "*"
}
jawalonoski commented 1 year ago

This is another case of the data dictionary being wrong.

The extra field is the FIPS county code, e.g., https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt

I updated the data dictionary to reflect the field.