hapifhir / hapi-fhir

🔥 HAPI FHIR - Java API for HL7 FHIR Clients and Servers
http://hapifhir.io
Apache License 2.0
2.02k stars 1.32k forks source link

Throughput difference between fresh schema and heavily populated schema #3783

Open dyoung-work opened 2 years ago

dyoung-work commented 2 years ago

Describe the bug Hi maintainers! I'm having some throughput issues with HAPI after updating 5.7.0 -> 6.0.1 and was hoping someone might be able to shed some light on what's going on.

We're trying to do an ingest of 100,000 bundles, each of which only has 4-5 resources. With the configuration we've done to try to improve ingest throughput, it was taking ~40 mins. After updating to 6.0.1, however, it's taking closer to 70-90 mins. Now the interesting thing is that this performance hit only applies for a Postgres schema that was also used under 5.7.0 (and therefore had the migrations applied to it). I did a test with a fresh schema on 6.0.1 and the throughput was back to our norm. The two schemas are running on the same Postgres host, and the FHIR server isn't changed at all, except for the name of the schema it's pointing to. I should probably also note that these tests were against a single FHIR server (no clustering).

I'm attaching a couple of screenshots of the CPU consumption graph from Grafana during the ingest. The relatively level CPU consumption is from the fresh schema, and the sawtooth pattern is for the migrated schema. Blue = load generator, green = FHIR server. I should also note that there aren't any obvious errors in the FHIR logs during the ingest with the migrated schema, and we get the correct number of results on the other end.

Fresh schema: image

Migrated schema: image

In case it's relevant, here are the changes we've made to hit our current throughput numbers: (but feel free to share more! We'd love to learn a new tweak)

Is there a chance that there's some odd interaction happening for migrated schemas? I redid the migrations manually as a sanity-check, but there was no change to the behaviour mentioned above. I'm hoping there's a way to salvage the existing data and the better throughput.

To Reproduce

  1. Use a Postgres schema created under 5.7.0
  2. Ingest data under tenant A and see normal performance
  3. Upgrade to 6.0.1, retaining original data
  4. Ingest data under tenant B and see impacted performance
  5. Changing no other config, create a fresh schema and point FHIR to that
  6. Ingest data under tenant C and see normal performance, as in step 2

Expected behavior Using a migrated schema instead of a fresh one shouldn't impact ingest performance.

Environment (please complete the following information):

Additional Info I've done multiple runs, so I know it's not a bad run. The DB host, and the hosts everything else is running on are dedicated for my testing (and can easily handle the current workload).

jamesagnew commented 2 years ago

Are you able to export the schemas in each to see if there are any obvious differences?

dyoung-work commented 2 years ago

EDIT: Removed exported schema files, as this turns out to have been a coincidence and I don't want people to waste time digging through those. Also updated the title.

dyoung-work commented 2 years ago

It's starting to look like the actual culprit (as confirmed by wiping my test DB and starting fresh on 6.0.1) is the number of resources in the DB. I started to see the sawtooth CPU consumption again, and its intensity has been increasing from run to run for the last several days. The count in hfj_resources going into the millions correlates very strongly with this behaviour.

Note, however, that this resource count is across multiple tenants, and each test is only interacting with a single new tenant. Is it possible that there's a missing index or missing filter when resources are added?