JPA server stops responding under load when caching is disabled or configured to low timeout

jarimayenburg commented 3 years ago

Describe the bug When configuring the JPA server to not use the search cache (reuse_cached_search_results_millis: 0 in the starter project) or a search cache with relatively small timeout (reuse_cached_search_results_millis: 10000), the server stops responding with a load of a certain amount of parallel requests.

To Reproduce Steps to reproduce the behavior:

Run the stock JPA server starter project with a MySQL database
Perform 10,000 or 100,000 parallel requests, observe that while the responses take long, the server never stops processing the requests.
Now update the config reuse_cached_search_results_millis to a value of 10000 or below (I used 0).
Perform the same amount of requests, observe that the server stops processing the requests after a little while. With reuse_cached_search_results_millis set to 0, my instance already stopped responding after less than 100 simultaneous requests.

Expected behavior I'd expect the server to respond slower, but never stop processing the incoming requests.

Environment:

HAPI FHIR Version: 5.4.2

Additional context Re-enabling the cache to the default of 60 seconds is no option for us, because that causes newly created resources to not be found in searches within 60 seconds after creation, which is a necessity for us.

jarimayenburg commented 3 years ago

Closing for now because the cause and reproduction steps aren't quite correct yet

jarimayenburg commented 3 years ago

Did some more testing today, above causes are correct after all. The combination of MySQL and a low reuse_cached_search_results_millis does the trick.

jamesagnew commented 3 years ago

Can you attach a thread dump taken when the system is hung?

On Tue, Aug 17, 2021 at 5:50 AM Jari Maijenburg @.***> wrote:

Reopened #2896 https://github.com/hapifhir/hapi-fhir/issues/2896.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hapifhir/hapi-fhir/issues/2896#event-5167863723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2N7HNLT7WKAOPO7TSEFT3T5IWH7ANCNFSM5CIESI4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

jarimayenburg commented 3 years ago

Here you go

dump.txt

jarimayenburg commented 3 years ago

After waiting for a while, Hikari starts to give timeouts

hapi-fhir-jpaserver-start    | 2021-08-17 10:26:05.784 [http-nio-8080-exec-31] INFO  fhirtest.access [LoggingInterceptor.java:160] ERROR - GET http://localhost:8080/fhir/Patient
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.652 [search_coord_9] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.652 [search_coord_9] ERROR o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:142] HikariPool-1 - Connection is not available, request timed out after 30000ms.
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.654 [search_coord_14] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.654 [search_coord_14] ERROR o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:142] HikariPool-1 - Connection is not available, request timed out after 30000ms.
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.655 [search_coord_12] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.655 [search_coord_12] ERROR o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:142] HikariPool-1 - Connection is not available, request timed out after 30000ms.
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.656 [search_coord_10] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.656 [search_coord_10] ERROR o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:142] HikariPool-1 - Connection is not available, request timed out after 30000ms.
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.655 [search_coord_19] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.657 [search_coord_19] ERROR o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:142] HikariPool-1 - Connection is not available, request timed out after 30000ms.
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.655 [search_coord_15] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.656 [search_coord_7] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.656 [search_coord_13] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null
hapi-fhir-jpaserver-start    | 2021-08-17 10:26:06.657 [search_coord_3] WARN  o.h.e.jdbc.spi.SqlExceptionHelper [SqlExceptionHelper.java:137] SQL Error: 0, SQLState: null

jarimayenburg commented 3 years ago

For ease of reproduction, here's a minimal patch on the starter project.

diff --git a/docker-compose.yml b/docker-compose.yml
index fcf1f2b..4ea3d1b 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -4,10 +4,15 @@ services:
     build: .
     container_name: hapi-fhir-jpaserver-start
     restart: on-failure
+    environment:
+      SPRING_DATASOURCE_URL: jdbc:mysql://hapi-fhir-mysql:3306/hapi?serverTimezone=Europe/Amsterdam
+      SPRING_DATASOURCE_USERNAME: admin
+      SPRING_DATASOURCE_PASSWORD: admin
+      SPRING_DATASOURCE_DRIVERCLASSNAME: com.mysql.cj.jdbc.Driver
     ports:
       - "8080:8080"
   hapi-fhir-mysql:
-    image: mysql:latest
+    image: mysql:5.7
     container_name: hapi-fhir-mysql
     #https://dev.mysql.com/doc/refman/8.0/en/identifier-case-sensitivity.html
     command: --lower_case_table_names=1
diff --git a/src/main/resources/application.yaml b/src/main/resources/application.yaml
index 1ff811a..82df865 100644
--- a/src/main/resources/application.yaml
+++ b/src/main/resources/application.yaml
@@ -108,7 +108,7 @@ hapi:
 #    max_binary_size: 104857600
 #    max_page_size: 200
 #    retain_cached_searches_mins: 60
-#    reuse_cached_search_results_millis: 60000
+    reuse_cached_search_results_millis: 0
     tester:
         home:
           name: Local Tester

jamesagnew commented 3 years ago

That thread dump looks to me like threads are blocking waiting for connections from the pool. Can you try increasing your connection pool size?

jarimayenburg commented 3 years ago

Increasing the Hikari connection pool fixes this issue in the starter project, but not in our FHIR server, which is a fork of the starter project with mainly some config changes. In that project, even with a connection pool with a size of 100 the server stops responding with +/- 70 simultaneous requests. Let me figure out if I can reproduce that in the starter project with a greater pool size.

jamesagnew commented 3 years ago

FWIW the issue at least in the starter project is that Hibernate has to compensate for the lack of native database sequences in MySQL by using a table to achieve the same thing. This causes the occasional request to require 2 connections because of the way hibernate accomplishes this, so the problem is that if your DB pool isn't bigger than your server thread pool you get what is effectively a deadlock (it won't show up as a real deadlock in any management tools, but that's basically what it is).

If your real system shows lots of threads blocked waiting on PooledOptimizer, you're hitting the same issue.

To be honest, I think we should probably put stronger guidance against using MySQL on the starter project - its performance is lousy compared to Postgres for all kinds of reasons including this one.

jarimayenburg commented 3 years ago

Thanks for the info. Our real system does indeed show lots of threads waiting on PooledOptimizer, so it's the same issue.

I just tried Postgres for our project and that's performing like a champ. I'll look into migrating our system to Postgres using something like pgloader.

I agree with the stronger guidance against MySQL and in favor of something like PostgreSQL, especially for production scenarios.

jamesagnew commented 3 years ago

One thing to note - If you migrate your data to PG it will work, but the one thing you'll need to do manually is to set the DB sequences to the same values from the MySQL ID generation table.

jarimayenburg commented 3 years ago

Ok thanks for the heads-up. Do you mean the values in the seq_* tables? Aren't those values copied to the Postgres DB in the migration itself?

jamesagnew commented 3 years ago

Yup, those are the ones. As far as I know those won't auto migrate, only because they are a different kind of thing across the platforms (tables in MySQL vs sequences in Postgres)

On Tue, Aug 17, 2021 at 11:57 AM Jari Maijenburg @.***> wrote:

Ok thanks for the heads-up. Do you mean the values in the seq_* tables? Aren't those values copied to the Postgres DB in the migration itself?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hapifhir/hapi-fhir/issues/2896#issuecomment-900421392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2N7HOIVQGWFOMKGMX52CDT5KBH5ANCNFSM5CIESI4Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

smilenannan commented 2 years ago

@jamesagnew

Thank you for this discussion. I decided to use PostgreSQL instead of MySQL. Is there any versions of PostgreSQL which are currently supported by JPA server?

hapifhir / hapi-fhir

JPA server stops responding under load when caching is disabled or configured to low timeout #2896