Throughput difference between fresh schema and heavily populated schema

dyoung-work commented 2 years ago

Describe the bug Hi maintainers! I'm having some throughput issues with HAPI after updating 5.7.0 -> 6.0.1 and was hoping someone might be able to shed some light on what's going on.

We're trying to do an ingest of 100,000 bundles, each of which only has 4-5 resources. With the configuration we've done to try to improve ingest throughput, it was taking ~40 mins. After updating to 6.0.1, however, it's taking closer to 70-90 mins. Now the interesting thing is that this performance hit only applies for a Postgres schema that was also used under 5.7.0 (and therefore had the migrations applied to it). I did a test with a fresh schema on 6.0.1 and the throughput was back to our norm. The two schemas are running on the same Postgres host, and the FHIR server isn't changed at all, except for the name of the schema it's pointing to. I should probably also note that these tests were against a single FHIR server (no clustering).

I'm attaching a couple of screenshots of the CPU consumption graph from Grafana during the ingest. The relatively level CPU consumption is from the fresh schema, and the sawtooth pattern is for the migrated schema. Blue = load generator, green = FHIR server. I should also note that there aren't any obvious errors in the FHIR logs during the ingest with the migrated schema, and we get the correct number of results on the other end.

Fresh schema:

Migrated schema:

In case it's relevant, here are the changes we've made to hit our current throughput numbers: (but feel free to share more! We'd love to learn a new tweak)

turn off Lucene indexing (we don't need full-text search)
10 worker threads in our load generator to send to HAPI (using the JPA starter as our HAPI server)
(delete_)expunge_enabled = false (not for throughput purposes, but I figured I'd include this change just in case)
we're using tenants (again, not throughput related, but figured I should include it)

Is there a chance that there's some odd interaction happening for migrated schemas? I redid the migrations manually as a sanity-check, but there was no change to the behaviour mentioned above. I'm hoping there's a way to salvage the existing data and the better throughput.

To Reproduce

Use a Postgres schema created under 5.7.0
Ingest data under tenant A and see normal performance
Upgrade to 6.0.1, retaining original data
Ingest data under tenant B and see impacted performance
Changing no other config, create a fresh schema and point FHIR to that
Ingest data under tenant C and see normal performance, as in step 2

Expected behavior Using a migrated schema instead of a fresh one shouldn't impact ingest performance.

Environment (please complete the following information):

HAPI FHIR Version: 5.7.0 -> 6.0.1 (+JPA starter 5.7.0 -> 6.0.1)
OS: Linux, containerized
Browser: N/A

Additional Info I've done multiple runs, so I know it's not a bad run. The DB host, and the hosts everything else is running on are dedicated for my testing (and can easily handle the current workload).

jamesagnew commented 2 years ago

Are you able to export the schemas in each to see if there are any obvious differences?

dyoung-work commented 2 years ago

EDIT: Removed exported schema files, as this turns out to have been a coincidence and I don't want people to waste time digging through those. Also updated the title.

dyoung-work commented 2 years ago

It's starting to look like the actual culprit (as confirmed by wiping my test DB and starting fresh on 6.0.1) is the number of resources in the DB. I started to see the sawtooth CPU consumption again, and its intensity has been increasing from run to run for the last several days. The count in hfj_resources going into the millions correlates very strongly with this behaviour.

Note, however, that this resource count is across multiple tenants, and each test is only interacting with a single new tenant. Is it possible that there's a missing index or missing filter when resources are added?

hapifhir / hapi-fhir

Throughput difference between fresh schema and heavily populated schema #3783