Runs with same seed are not producing identical results

KevinCranmer commented 1 year ago

What happened?

Reading the wiki it looks like "Populations generated with the same seed and the same version of Synthea should be identical".

But I've noticed that this is not the case. I'm consistently seeing the same 3 types of differences across multiple runs on the same Synthea version.

Array ordering: https://www.diffchecker.com/7pX3yd1w/
- not a big issue for my use case
DocumentReference data: https://www.diffchecker.com/i3UikbSa/
- more of an issue. The key difference, when decoded, is one line saying "Patient currently has Aetna." vs "Patient currently has Humana."
Actual IDs changing: https://www.diffchecker.com/OFqRg6FW/
- big issue for my use case as I can no longer trust IDs are the same across multiple runs.

Command I'm using: ./run_synthea -s 1668660039331 -cs 1668660039331 -r 20230101 -p 83 Florida

Environment

- OS: macOS 12.6
- Java: openjdk version 16.0.2

Relevant log output

No response

jawalonoski commented 1 year ago

This is a bug that we'll look at as a high priority.

The DocumentReference and IDs will likely be fixable.

The Array ordering is probably not something we can fix -- because the FHIR JSON export is done by an underlying dependency -- but we'll take a look.

dehall commented 1 year ago

@KevinCranmer How frequently are you seeing the issue of resource IDs changing between runs? I'm consistently able to reproduce the first two issues (array ordering and insurance plan) and I'm working on fixes, but across a dozen tests I haven't seen any instances of resource IDs being different across runs. (Though I have found a separate issue of ImagingStudy DICOM UIDs occasionally differing)

I'm wondering if I'm missing something or if this is just an exceptionally rare case we need to chase down

KevinCranmer commented 1 year ago

@dehall It seems like the IDs changing only happens if I re-clone Synthea and run the same command. I'm cloning the same commit each time, so the Synthea versions would be the same.

But this would mean if two machines were to both clone and run Synthea, they should experience the ID changing difference.

Do you see this ID difference if you re-clone?

dehall commented 1 year ago

Hmm I'm still not seeing it. My sequence of commands I ran is below. Are you running two fresh copies now and seeing different IDs, or are you running one fresh copy and comparing the output to an existing dataset you have? If anything changed in between, like any modifications to the code, that would produce different results. Though I'm very confused because the clinical content of all the records you sent is the same (minus the insurance bit), and if the code was changed then the IDs should always mismatch on all of your test records, not just sometimes. Your first diff has expired but your diff 2 shows an instance where the IDs do match.

rm -rf synthea_fresh_*

for s in 1 2 3
do
  echo running $s
  git clone https://github.com/synthetichealth/synthea synthea_fresh_${s}
  cd synthea_fresh_${s}
  git checkout f26777c4
  ./run_synthea -s 1668660039331 -cs 1668660039331 -r 20230101 -p 83 Florida > log_${s}.txt
  cd ..
done

for file in `ls synthea_fresh_1/output/fhir`; do echo $file; diff synthea_fresh_1/output/fhir/$file synthea_fresh_2/output/fhir/$file ; done | less
for file in `ls synthea_fresh_1/output/fhir`; do echo $file; diff synthea_fresh_1/output/fhir/$file synthea_fresh_3/output/fhir/$file ; done | less

KevinCranmer commented 1 year ago

I haven't been able to reproduce it today either. I had a hunch that it was more likely the longer it takes between runs. If I compare with my runs yesterday, I see the IDs changed but all runs today haven't changed IDs. I'm pretty confident I haven't changed anything.

This is what I've been running:

runSynthea() {
  chmod +x bin/clone-repo
  bin/clone-repo synthetichealth synthea "$WORKSPACE"
  cd "$WORKSPACE/synthea"
  git checkout f26777c403bdc6400049f10a76e6979ec3244ac2
  sed -i '' 's/exporter\.fhir\.included_resources =$/exporter.fhir.included_resources = AllergyIntolerance,Condition,Device,DiagnosticReport,DocumentReference,Encounter,Immunization,Location,Medication,MedicationRequest,Observation,Organization,Patient,Practitioner,PractitionerRole,Procedure/' src/main/resources/synthea.properties
  ./run_synthea -s 1668660039331 -cs 1668660039331 -r 20230101 -p 83 Florida
}

dehall commented 1 year ago

Thanks that's helpful info. I also noticed this morning in your example of mismatching IDs https://www.diffchecker.com/OFqRg6FW/ , the IDs are not completely different, they are offset by 13: the first Encounter in the file on the right has the same ID as the Observation 13 resources down in the left file "e1239afc-e199-39f6-1bb1-7488c675e51c". Then as you go down resource-by-resource the IDs do line up (they are on different resources obviously but the sequence of IDs as you work through the resources is the same).

That makes sense given how the random number generator works to pick IDs, but it means somehow the patient on the right had their random number generator called 13 times that the patient on the left didn't, in a spot that didn't affect any of the clinical content on the record.

Is there any postprocessing done to these records? For example is it possible you previously filtered out certain resource types in a separate step and have switched to using the exporter.fhir.included_resources config setting?

KevinCranmer commented 1 year ago

We do post processing to convert the resources into our Database versions of the resources, filtering out some records. However; that's all done after Synthea has ran and then we aren't tweaking Synthea's output. We have always used the exporter.fhir.included_resources config setting.

KevinCranmer commented 1 year ago

I ran Synthea twice, changed my computers date time to 2 days from now (Sat Jan 14th) and ran Synthea a third time. I'm seeing the ID difference only on the third run.

My diff from your above command is now mostly the ID difference. Here's a chunk:

...
Carlos172_Champlin946_750c53de-9827-2ffc-9fba-19f928cafb90.json
165c165
<     "fullUrl": "urn:uuid:b1411685-1aaa-b499-b4fd-addaea86fb9f",
---
>     "fullUrl": "urn:uuid:6346d93f-83da-de2f-8085-022b77a56a1f",
168c168
<       "id": "b1411685-1aaa-b499-b4fd-addaea86fb9f",
---
>       "id": "6346d93f-83da-de2f-8085-022b77a56a1f",
175c175
<         "value": "b1411685-1aaa-b499-b4fd-addaea86fb9f"
---
>         "value": "6346d93f-83da-de2f-8085-022b77a56a1f"
232c232
<     "fullUrl": "urn:uuid:417bb0a7-2fbc-05c9-2ce9-47305fd9798c",
---
>     "fullUrl": "urn:uuid:5c6b5c8e-dcaf-29ff-e0fa-c63865f924b1",
235c235
<       "id": "417bb0a7-2fbc-05c9-2ce9-47305fd9798c",
---
>       "id": "5c6b5c8e-dcaf-29ff-e0fa-c63865f924b1",
270c270
<         "reference": "urn:uuid:b1411685-1aaa-b499-b4fd-addaea86fb9f"
---
>         "reference": "urn:uuid:6346d93f-83da-de2f-8085-022b77a56a1f"
...

dehall commented 1 year ago

Thanks for the extra testing -- that's really interesting that the date has an effect on it, but that could explain why I haven't been able to replicate it by running multiple times in short succession. Maybe our "reference date" logic isn't as consistent as we expected and something from the current date/time sneaks in. I'll give that a shot as well

KevinCranmer commented 1 year ago

You had mentioned that the ID differences means the randomNumberGenerator was being called more in one run than another.

I had read this response about how Synthea will re-create patients that are deceased trying to get the population size: -p, to equal the inputted param (83 in my case).

I'm wondering if my third run had created one or two patients that were deceased Jan 14th but were alive Jan 12th, So Synthea had to re-create these (now) deceased patients; whereas before, the patients were alive and only created once. This could explain the extra randomNumberGenerator calls.

Edit: I took a look at the Synthea output from the three runs and they all only had 14 people listed as DECEASED and I would've expected 15/16 on the third run. So perhaps this is not the issue.

dehall commented 1 year ago

Ok that last clue of the date being relevant solved it. I had thought the Reference Date config setting is also when the simulation ends, but no those are separate settings. So even though you specify the reference date, the end date is set to "today" by default and if you run the same set of patients a week later the simulation runs a week longer. (I'm a little surprised you didn't get any records that have additional data as a result.) The simulation running a little longer means the random number generator gets called a few more times for certain patients, resulting in the IDs being different as we saw. The "quick fix" for now is to explicitly specify an end date in your script with the -e flag, ex:

./run_synthea -s 1668660039331 -cs 1668660039331 -r 20230101 -e 20230112 -p 83 Florida

Feel free to change 20230112 to another YYYYMMDD of your choice. (You can also confirm the issue as you saw it by leaving your system date as-is and changing the synthea end date)

More broadly I want to do the following:

Document all of this on a wiki page specifically about recreating a consistent population, what flags need to be set and what the expectations are
Change the random number generation to ensure consistent UUIDs in the exporter. If someone regenerates the same population with a later end date, you want the same resources to have the same UUIDs, and also you want the simulation to continue and create additional new data. This may be tougher than it sounds so I'll need to dig in a little more before I can give a timeframe for that

KevinCranmer commented 1 year ago

Awesome! Thank you so much for looking into this quickly. It is much appreciated!

dehall commented 1 year ago

@KevinCranmer just confirming we haven't forgotten about this - I have a PR up on branch repro_fixes that I'm hoping we can get reviewed and merged this week. If you're able to test it out and confirm it fixes your issues early that would be helpful, but no worries if not possible.
Note that the fix changes a lot of random number rolls, so a set of patients generated with the new version will look completely different to any set of patients generated with your previous version. But going forward all runs generated with the new version and the same settings should be identical as expected.

KevinCranmer commented 1 year ago

@dehall Thanks for working on this so quickly. I just tested and saw that all my original reproducibility issues are no longer appearing, even without the -e tag.

Something I've recently noticed is that I'm getting different results on different machines. For example, running my original command on your new branch on my work machine, I get a file Josephine273_Steuber698_ebf4548f-6837-e078-ff78-f1e829cff265.json but on my personal machine (both Macs), instead, I get Josephine273_Justine412_Kerluke267_ebf4548f-6837-e078-ff78-f1e829cff265.json

The files have the same UUID, but a different name and vastly different contents: https://www.diffchecker.com/MA6TcaPh/

Many files have differences like this. Is the randomness machine dependent?

dehall commented 1 year ago

The short answer is it looks like you're running on 2 different versions here - you can see the commit hash that each was run on in the Patient.text field

The longer answer is that we don't expect there to be differences across machines. We use a seeded version of the java.util.Random class, which has the following note in the docs so in general the assumption should be - equal command and equal version of synthea = equal results even if on a different system.

If two instances of Random are created with the same seed, and the same sequence of method calls is made for each, they will generate and return identical sequences of numbers. In order to guarantee this property, particular algorithms are specified for the class Random. Java implementations must use all the algorithms shown here for the class Random, for the sake of absolute portability of Java code. https://docs.oracle.com/en/java/javase/18/docs/api/java.base/java/util/Random.html

If you do find something that's different across systems and not attributable to anything else, definitely let us know and we can try to figure out what's going on and fix it.

KevinCranmer commented 1 year ago

Ah yeah looks like I forgot to fetch on my personal machine.

Unfortunately, I'm still seeing differences in a linux docker container which was what originally caught my eye. https://www.diffchecker.com/Yf9h84fT/

Perhaps this could be due to errors that appear in linux for me but not on my Mac:

13:00:34  ./synthea/src/main/java/org/mitre/synthea/export/rif/SNFExporter.java:333: error: unmappable character (0xE2) for encoding US-ASCII
13:00:34       * Instrument 3.0 User???s Manual", see
13:00:34                            ^
13:00:34  ./synthea/src/main/java/org/mitre/synthea/export/rif/SNFExporter.java:333: error: unmappable character (0x80) for encoding US-ASCII
13:00:34       * Instrument 3.0 User???s Manual", see
13:00:34                             ^
13:00:34  ./synthea/src/main/java/org/mitre/synthea/export/rif/SNFExporter.java:333: error: unmappable character (0x99) for encoding US-ASCII
13:00:34       * Instrument 3.0 User???s Manual", see

dehall commented 1 year ago

Ok yeah, I can confirm the same thing. Runs on bare metal on my Mac are always consistent, runs with the exact same code copied into a Docker image are always consistent, but between the two it's not identical. At a glance my Mac output matches the left in your diff and my Docker output matches the right, so to me that suggests a difference by OS rather than by individual machine. I'll keep digging.

dehall commented 1 year ago

Ok, I've pushed one more update to the branch that should hopefully fix everything. Turns out HashMaps tend to be sorted consistently, but differently across OSes. I've updated the spots I found in testing but it's very possible I missed some that might get hit via different patient trajectories.

One other thing to make sure of is that your two instances are running with the same time zone. The internal simulation uses UTC but all dates are exported in the system time zone, so that can result in records that are conceptually equivalent but textually different

dehall commented 1 year ago

@KevinCranmer just wanted to make sure you saw this latest update above -- hopefully everything should be consistent across OSes as well now

KevinCranmer commented 1 year ago

@dehall Hey I've unfortunately found some more differences... https://www.diffchecker.com/UJJMXQ6K/

The right side is on linux, the left on Mac. Both ran today. Looks like there are some differences in dates (expected, not an issue), differences in Observation values, and the right side has some additional resources that seem to screw up the IDs afterwards.

The synthea command I'm running: ./run_synthea -s 1668660039331 -cs 1668660039331 -r 20230101 -p 79 Florida

dehall commented 1 year ago

Ahh ok thanks for the update. I'll be able to look into this early next week

dehall commented 1 year ago

@KevinCranmer I just pushed up a new branch for the latest fixes -- more_repro_fixes . I wasn't able to replicate your exact instance, but I did see a few similar instances. I'm pretty sure the issue was with age calculation being based on the local system time zone, so for young children it was possible for their age in months to be different and their paths would diverge. (Technically possible for anyone of any age, but a lot more likely both to occur and be meaningful for young children) Can you give that branch a shot and confirm it fixes the latest issue?

synthetichealth / synthea