tolgee / tolgee-platform

Developer & translator friendly web-based localization platform
https://tolgee.io
Other
2.04k stars 203 forks source link

Tolgee runs out of heap space when importing a larger number of translations #2019

Closed siepkes closed 10 months ago

siepkes commented 11 months ago

Tolgee runs out of heap space when importing a couple of thousand translation keys. I think this is the same issue as described in #1930.

I think this is caused by Tolgee loading it's entire job queue in to memory. This includes entries it is not currently processing (i.e. the ones which are queued). I haven't investigated further yet but for some reason these job queue entries are fairly large in memory, 1MB+. This means when importing a translation file with for example 3000 entries Tolgee will run out of heap space even when running with a 2GB heap. Once the job queue is saturated Tulgee will never recover from it without manual intervention because on every reboot it will load the job queue and go out of heap space.

During testing we did not have these issues with the Docker image. I have a hunch this is related to the machine translation (which is not enabled in the Docker image). But I haven't investigated this yet.

After making a heap dump and importing it in the Eclipse Memory analyzer one can see the heap consists almost solely out of PostgreSQL tuples:

Screenshot from 2023-12-11 13-31-54

This is the stacktrace associated with these objects:

MessageBroker-2
  at java.io.InputStream.skip(J)J (InputStream.java:545)
  at org.postgresql.core.VisibleBufferedInputStream.skip(J)J (VisibleBufferedInputStream.java:278)
  at org.postgresql.core.PGStream.skip(I)V (PGStream.java:657)
  at org.postgresql.core.PGStream.receiveTupleV3()Lorg/postgresql/core/Tuple; (PGStream.java:609)
  at org.postgresql.core.v3.QueryExecutorImpl.processResults(Lorg/postgresql/core/ResultHandler;IZ)V (QueryExecutorImpl.java:2332)
  at org.postgresql.core.v3.QueryExecutorImpl.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;Lorg/postgresql/core/ResultHandler;IIIZ)V (QueryExecutorImpl.java:355)
  at org.postgresql.jdbc.PgStatement.executeInternal(Lorg/postgresql/core/CachedQuery;Lorg/postgresql/core/ParameterList;I)V (PgStatement.java:490)
  at org.postgresql.jdbc.PgStatement.execute(Lorg/postgresql/core/CachedQuery;Lorg/postgresql/core/ParameterList;I)V (PgStatement.java:408)
  at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(I)Z (PgPreparedStatement.java:167)
  at org.postgresql.jdbc.PgPreparedStatement.executeQuery()Ljava/sql/ResultSet; (PgPreparedStatement.java:119)
  at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery()Ljava/sql/ResultSet; (ProxyPreparedStatement.java:52)
  at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery()Ljava/sql/ResultSet; (HikariProxyPreparedStatement.java)
  at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(Ljava/sql/PreparedStatement;)Ljava/sql/ResultSet; (ResultSetReturnImpl.java:57)
  at org.hibernate.loader.Loader.getResultSet(Ljava/sql/PreparedStatement;Lorg/hibernate/engine/spi/RowSelection;Lorg/hibernate/dialect/pagination/LimitHandler;ZLorg/hibernate/engine/spi/SharedSessionContractImplementor;)Ljava/sql/ResultSet; (Loader.java:2322)
  at org.hibernate.loader.Loader.executeQueryStatement(Ljava/lang/String;Lorg/hibernate/engine/spi/QueryParameters;ZLjava/util/List;Lorg/hibernate/engine/spi/SharedSessionContractImplementor;)Lorg/hibernate/loader/Loader$SqlStatementWrapper; (Loader.java:2075)
  at org.hibernate.loader.Loader.executeQueryStatement(Lorg/hibernate/engine/spi/QueryParameters;ZLjava/util/List;Lorg/hibernate/engine/spi/SharedSessionContractImplementor;)Lorg/hibernate/loader/Loader$SqlStatementWrapper; (Loader.java:2037)
  at org.hibernate.loader.Loader.doQuery(Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/engine/spi/QueryParameters;ZLorg/hibernate/transform/ResultTransformer;)Ljava/util/List; (Loader.java:956)
  at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/engine/spi/QueryParameters;ZLorg/hibernate/transform/ResultTransformer;)Ljava/util/List; (Loader.java:357)
  at org.hibernate.loader.Loader.doList(Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/engine/spi/QueryParameters;Lorg/hibernate/transform/ResultTransformer;)Ljava/util/List; (Loader.java:2868)
  at org.hibernate.loader.Loader.doList(Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/engine/spi/QueryParameters;)Ljava/util/List; (Loader.java:2850)
  at org.hibernate.loader.Loader.listIgnoreQueryCache(Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/engine/spi/QueryParameters;)Ljava/util/List; (Loader.java:2682)
  at org.hibernate.loader.Loader.list(Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/engine/spi/QueryParameters;Ljava/util/Set;[Lorg/hibernate/type/Type;)Ljava/util/List; (Loader.java:2677)
  at org.hibernate.loader.hql.QueryLoader.list(Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/engine/spi/QueryParameters;)Ljava/util/List; (QueryLoader.java:540)
  at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/engine/spi/QueryParameters;)Ljava/util/List; (QueryTranslatorImpl.java:400)
  at org.hibernate.engine.query.spi.HQLQueryPlan.performList(Lorg/hibernate/engine/spi/QueryParameters;Lorg/hibernate/engine/spi/SharedSessionContractImplementor;)Ljava/util/List; (HQLQueryPlan.java:218)
  at org.hibernate.internal.SessionImpl.list(Ljava/lang/String;Lorg/hibernate/engine/spi/QueryParameters;)Ljava/util/List; (SessionImpl.java:1459)
  at org.hibernate.query.internal.AbstractProducedQuery.doList()Ljava/util/List; (AbstractProducedQuery.java:1649)
  at org.hibernate.query.internal.AbstractProducedQuery.list()Ljava/util/List; (AbstractProducedQuery.java:1617)
  at org.hibernate.query.Query.getResultList()Ljava/util/List; (Query.java:165)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (NativeMethodAccessorImpl.java(Native Method))
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Method.java:568)
  at org.springframework.orm.jpa.SharedEntityManagerCreator$DeferredQueryInvocationHandler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object; (SharedEntityManagerCreator.java:410)
  at jdk.proxy2.$Proxy228.getResultList()Ljava/util/List; ()
  at io.tolgee.batch.BatchJobChunkExecutionQueue.populateQueue()V (BatchJobChunkExecutionQueue.kt:63)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (NativeMethodAccessorImpl.java(Native Method))
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Method.java:568)
  at org.springframework.scheduling.support.ScheduledMethodRunnable.run()V (ScheduledMethodRunnable.java:84)
  at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run()V (DelegatingErrorHandlingRunnable.java:54)
  at java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Executors.java:539)
  at java.util.concurrent.FutureTask.runAndReset()Z (FutureTask.java:305)
  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V (ScheduledThreadPoolExecutor.java:305)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1136)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:635)
  at java.lang.Thread.run()V (Thread.java:833)

This is what the job queue looks like:

postgres=# select count(*) from tolgee.tolgee_batch_job_chunk_execution;
 count 
-------
  5543
(1 row)

This originates from the Tolgee job queue code:

https://github.com/tolgee/tolgee-platform/blob/fa9d12497a0cba57b70cc781ee68cfd6f70cc9f1/backend/data/src/main/kotlin/io/tolgee/batch/BatchJobChunkExecutionQueue.kt#L48-L68

A possible fix would be to have Tolgee retrieve batches (of for example 10 or 20 entries) instead of retrieving the entire queue.

siepkes commented 11 months ago

We (@GeertZondervan and I) did some additional troubleshooting and thought it might be helpful to share our findings here.

Overall I'm guessing by the use of LockOptions.SKIP_LOCKED the intention is to create a simple queue implementation in PostgreSQL (something similar as described here)?

The first thing I noticed is that the queue implementation will always retrieve all jobs. When implementing a simple queue with PostgreSQL's SKIP LOCKED usually you have a worker retrieve (and lock) a single job (i.e. row). Using SKIP LOCKED in your SQL query ensures that a worker will always get the next available job which isn't being worked on. However the current implementation retrieves all jobs and puts them in a local queue in memory and then starts working on that.

The main reason why a high volume of jobs crashed the system (by running out of heap space) was because Hibernate was retrieving far too much fields (i.e. not just a job ID and job status) from the database and then tried to stuff everything into a resultset. This is the point where the application goes out of heap space.

We thought we could easily fix this by having Hibernate detach only the things it needed. While this did fix the issue with going out of heap space it brought to light another issue. The application now locks-up after boot. I think this is because the queue execution is manually triggered here on application startup:

https://github.com/tolgee/tolgee-platform/blob/fa9d12497a0cba57b70cc781ee68cfd6f70cc9f1/backend/data/src/main/kotlin/io/tolgee/batch/BatchJobActionService.kt#L56-L61

This blocks the thread in Spring boots which fires events (I don't think this thread is suitable for doing such long running work). This also has another side-effect: The job scheduler doesn't know about it. Meaning in 60 seconds the @Scheduled(fixedRate = 60000) of BatchJobChunkExecutionQueue.kt will kick in even if the manual invocation is still running. Using something like @Scheduled(fixedRate = 60000, initialDelay = 0) might be more appropriate.

JanCizmar commented 11 months ago

Hey! Thanks a lot for the investigation!

So if I understand it correctly, the issue is in the populateQueue() method, right? So what about if we didn't return an entity, but some kind of view containing only the fields we need? That might save some memory...

@Scheduled(fixedRate = 60000, initialDelay = 0)

Yep!

siepkes commented 11 months ago

So if I understand it correctly, the issue is in the populateQueue() method, right?

Yes.

So what about if we didn't return an entity, but some kind of view containing only the fields we need? That might save some memory...

Yes, that would probably prevent Tolgee from running out of heap space. The issue is that currently Hibernate apparently joins in the table with the JSONB parameter field for every entry in the tolgee_batch_job_chunk_execution table (which in our case had 5500+ entries). Which leads to the large number of PostgreSQL JDBC tuple instances of about 1MB a piece in our JVM heap.

Combined with not blocking the Spring Boot event thread by using@scheduled(fixedRate = 60000, initialDelay = 0) and not calling batchJobChunkExecutionQueue.populateQueue() I think Tolgee should run normally even with a large queue.

cyyynthia commented 11 months ago

Wouldn't it be better to revise the task population in a way the Tolgee server would both avoid retrieving unnecessary data (easy doable with a new map in the Hibernate query or a special view, and never fetch all the tasks but only parts of it?

The server would for instance pick N jobs that are pending, mark them as "taken" (so other instances in a distributed context would not pick them too), and run these N jobs. Whenever the job queue reaches M jobs with M < N, the server would try to fetch more jobs to execute and start running them (M can be non-zero so the server can start fetching jobs before it runs out and ensure there's never a job exhaustion, so it keeps crunching through jobs).

Making sure a the "taken" status does automatically "expire" after a certain time (e.g. could not be executed because the node crashed or something), so it can be picked back up later on again. Unless there's something I'm missing, this should let the jobs be nicely distributed without requiring explicit node communication and/or awareness of their existence (except when the amount of jobs is < N, which should be rare and can be worked around by having servers who queue the job immediately mark them as taken if their job queue is not full, relying on the load distribution to act as a distribution layer). This would also make the memory footprint of the job queue mostly inexistent for a virtually infinitely large queue.

siepkes commented 11 months ago

The server would for instance pick N jobs that are pending, mark them as "taken" (so other instances in a distributed context would not pick them too), and run these N jobs. Whenever the job queue reaches M jobs with M < N, the server would try to fetch more jobs to execute and start running them (M can be non-zero so the server can start fetching jobs before it runs out and ensure there's never a job exhaustion, so it keeps crunching through jobs).

Sounds like a solid idea. It's what a lot of (Postgre)SQL job queue implementations do and what the SKIP LOCKED SQL instruction is mostly used for.

This article: How to implement a database job queue using SKIP LOCKED offers an example how to implement such a thing with Java and Hibernate using SKIP LOCKED.

JanCizmar commented 11 months ago

When implementing the feature, i was inspired by the skip locked approach, but I also combined it with fast state cache in redis and pub-sub in redis. So it's not very probable instances would try to work on the same chunks, since when something is removed from local queue, it's propagated to other instances via pub/sub.

That said, I think fetching only the required data and removing the manual trigger on app start should be enough. Right?

siepkes commented 11 months ago

That said, I think fetching only the required data and removing the manual trigger on app start should be enough. Right?

Yes, I think so!

JanCizmar commented 11 months ago

Are there some volunteers, who would like to fix it? ๐Ÿ˜…

siepkes commented 11 months ago

We already have a prototype fix so @GeertZondervan will create a PR!

siepkes commented 11 months ago

@JanCizmar We created a prototype which prevents the loading of the entire object and ensures the job is only started from the scheduler (and not manually). While this fixes the initial problems it uncovers more problems.

For example, while the initial job retrieval (from the DB to local cache) is now fast (takes milliseconds to complete) when a second retrieval is started it takes about 6 minutes to complete. This is being caused by Hibernate somehow, not by the DB. At this point the entire application freezes (the app works normal before this second retrieval). So it seems something else is also kicking-off this code and then suddenly it's long running work and blocking some main thread (some Kotlin co-routine executor thread perhaps?).

but I also combined it with fast state cache in redis and pub-sub in redis.

Could you explain what the benefit of the local cache is? It complicates the queue implementation a lot. The way I see it no query is actually prevented because when running multiple Tolgee nodes each node will still need to check if no other node is actually executing the task. Meaning that it needs to run a query at the start of the task? So if that's the case then why not only simply just SKIP LOCKED?

What makes the issue(s) hard to troubleshoot for us is that there is almost no documentation (not on it's conceptual design, nor on the implementation of the design). Without documentation or the on-hand knowledge of the queue implementation this rabbit hole seems a bit to deep for us.

JanCizmar commented 11 months ago

Hey!

Thanks for the feedback. I'll try to provide some docs. To fix this fast, can you provide some the prototype with some examaple data so I can reproduce it fast and possibly fix it?

The benefit of the implementation is to save the database resources. With pure skip locked it would either unbalanced (so single node would handle a lot of chunks while other nodes would be sleeping) or it would execute too many queries to database. While syncing the queues using Redis pub/sub and using Redis as a storage (on multiple node setup) seems like optimal compromise. in the same time I understand it's hard to read and complicated.

GeertZondervan commented 11 months ago

Hello @JanCizmar

The fix I've been working on can be found here

We have created a project with multiple languages and the following language settings; image

Note the Auto Translate Imported Items, which causes the creation of the jobs.

Now when we import one of our XLIFF files with about 5000 trans-units. it will create about 5000 rows in the tolgee_batch_job_chunk_execution table and a row in the tolgee_batch_job table with an enormous value in the target (jsonb) column. I suspect this enormous target column is the reason for the original Heap space issue. Which is solved by the changes I made in my branch. Unfortunately we ran into the problems described by @siepkes in his previous post.

JanCizmar commented 11 months ago

OK! Thanks a lot! I'll try to create some sample large xliff and will provide a fix soon!

JanCizmar commented 11 months ago

Created this PR including your changes. I've generated large XLIFF and not it works correctly for me, can you please check? I'll try to go a bit further and optimize the import & batch operations a bit more.

https://github.com/tolgee/tolgee-platform/pull/2025

JanCizmar commented 11 months ago

Hey! I've started to optimize the import process even more, which apparently led to a lot of issues. However, at this commit it's stable. So if you would like to test it with your data, use this one:

https://github.com/tolgee/tolgee-platform/pull/2025/commits/bb5a04773cba1da4bddfaea0fa45ea505ac2e5b2

JanCizmar commented 11 months ago

So I enjoyed pretty nice weekend Hibernate adventure. I went from trying to optimize Hibernate operations by keepeing the context smal, moving to StatelessSession and ignoring the context at all, but what I found the fastest was finally executing the queries via the sql connection prepareStatement and executeBatch directly.

I will continue by providing some reliable loading feedback in the UI and optimizing the UI so it's not so stressful for users to import data.

GeertZondervan commented 11 months ago

Hey! I've started to optimize the import process even more, which apparently led to a lot of issues. However, at this commit it's stable. So if you would like to test it with your data, use this one:

bb5a047

Hello! I tried testing your PR, using the 9a1017d as the latest commit. Unfortunately this does not seem to fix the issue for us. Just like with my original PR, the Heap Size problem is fixed, but then it gets stuck on hibernate doing a lot of work. The actual query in the populateQueue() method returns fast, but then hibernate does some loading of entity/values which is very slow, and blocks the entire application.

2023-12-18T14:51:38.295+01:00 DEBUG 774581 --- [MessageBroker-2] i.t.batch.BatchJobChunkExecutionQueue    : Running scheduled populate queue
2023-12-18T14:51:39.580+01:00 DEBUG 774581 --- [MessageBroker-2] tor$SharedEntityManagerInvocationHandler : Creating new EntityManager for shared EntityManager invocation
2023-12-18T14:51:49.853+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.sql.exec               : Skipping reading Query result cache data: cache-enabled = false, cache-mode = NORMAL
2023-12-18T14:51:51.135+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Initializer list
2023-12-18T14:51:52.404+01:00 DEBUG 774581 --- [alina-utility-1] o.apache.catalina.session.ManagerBase    : Start expire sessions StandardManager at 1702907511135 sessioncount 0
2023-12-18T14:51:53.620+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.SQL                        : select bjce1_0.id,bjce1_0.batch_job_id,bjce1_0.execute_after,bj1_0.job_character from tolgee_batch_job_chunk_execution bjce1_0 join tolgee_batch_job bj1_0 on bj1_0.id=bjce1_0.batch_job_id where bjce1_0.status=? order by bjce1_0.created_at asc nulls first,bjce1_0.execute_after asc nulls first,bjce1_0.id asc nulls first
2023-12-18T14:51:53.620+01:00 DEBUG 774581 --- [alina-utility-1] o.apache.catalina.session.ManagerBase    : End expire sessions StandardManager processingTime 1270 expired sessions: 0
2023-12-18T14:51:55.633+01:00 DEBUG 774581 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Pool stats (total=10, active=3, idle=7, waiting=0)
2023-12-18T14:51:55.634+01:00 DEBUG 774581 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Fill pool skipped, pool has sufficient level or currently being filled (queueDepth=0).
2023-12-18T14:52:01.507+01:00 DEBUG 774581 --- [nio-80-Acceptor] o.apache.tomcat.util.threads.LimitLatch  : Counting up[http-nio-80-Acceptor] latch=1
2023-12-18T14:52:02.778+01:00 DEBUG 774581 --- [nio-80-Acceptor] o.apache.tomcat.util.threads.LimitLatch  : Counting up[http-nio-80-Acceptor] latch=2
2023-12-18T14:52:07.225+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results.loading        : Calling top-level assembler (0 / 1) : org.hibernate.sql.results.graph.instantiation.internal.DynamicInstantiationAssemblerConstructorImpl@7b0071c1
2023-12-18T14:52:07.225+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [0] - [1000419001]
2023-12-18T14:52:07.226+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [1] - [1000418001]
2023-12-18T14:52:07.226+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [2] - [null]
2023-12-18T14:52:07.226+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [3] - [SLOW]
2023-12-18T14:52:07.226+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results.loading        : Calling top-level assembler (0 / 1) : org.hibernate.sql.results.graph.instantiation.internal.DynamicInstantiationAssemblerConstructorImpl@7b0071c1
2023-12-18T14:52:07.226+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [0] - [1000419002]
2023-12-18T14:52:07.226+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [1] - [1000418001]
2023-12-18T14:52:07.226+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [2] - [null]
2023-12-18T14:52:07.226+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [3] - [SLOW]
...
...
...
2023-12-18T14:54:48.508+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results.loading        : Calling top-level assembler (0 / 1) : org.hibernate.sql.results.graph.instantiation.internal.DynamicInstantiationAssemblerConstructorImpl@7b0071c1
2023-12-18T14:54:48.508+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [0] - [1000419327]
2023-12-18T14:54:48.508+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [1] - [1000418001]
2023-12-18T14:54:48.508+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [2] - [null]
2023-12-18T14:54:48.508+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [3] - [SLOW]
2023-12-18T14:54:48.508+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results.loading        : Calling top-level assembler (0 / 1) : org.hibernate.sql.results.graph.instantiation.internal.DynamicInstantiationAssemblerConstructorImpl@7b0071c1
2023-12-18T14:54:48.508+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [0] - [1000419328]
2023-12-18T14:54:50.514+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [1] - [1000418001]
2023-12-18T14:54:51.777+01:00 DEBUG 774581 --- [MessageBroker-2] org.hibernate.orm.results                : Extracted JDBC value [2] - [null]
...
...

For some reason the initial scheduled run does not cause any problems, the queue is filled quickly, but the next calls to populateQueue() are very slow.

So I enjoyed pretty nice weekend Hibernate adventure. I went from trying to optimize Hibernate operations by keepeing the context smal, moving to StatelessSession and ignoring the context at all, but what I found the fastest was finally executing the queries via the sql connection prepareStatement and executeBatch directly.

  • Activity is not stored using the batch statements and it doesn't go through hibernate at all
  • The same happens for new batch job creation
  • I optimized the way auto translation job is created, so it doesn't create targets for base language and not configured languages
  • Generally, the import is much faster and lightweight now
  • I was testing it using the profier included in IntelliJ IDEA and I haven't encountered any suspicious memory allocation behaviour
  • I was testing it appended this generated large XLIFF file translations.xliff.zip

I will continue by providing some reliable loading feedback in the UI and optimizing the UI so it's not so stressful for users to import data.

Sounds great! Looking forward to it.

JanCizmar commented 11 months ago

Does it also happen with my xliff? Can you also share your language settings? Can you also share your configuration?

EDIT: Sorry, lot of questions, but I really need to know ๐Ÿ˜…

siepkes commented 11 months ago

I don't think there is much going on in our settings. This is our applications.properties:

logging.level.root=INFO

server.port=80

tolgee.postgres-autostart.enabled=false
spring.datasource.url=jdbc:postgresql://postgresql-13-misc-1.inst.sp-prod.nl1.cns.serviceplanet.nl,postgresql-13-misc-2.inst.sp-prod.nl1.cns.serviceplanet.nl,postgresql-13-misc-3.inst.sp-prod.nl1.cns.serviceplanet.nl/postgres?targetServerType=primary&hostRecheckSeconds=5
spring.datasource.username=[REDACTED]
spring.datasource.password=[REDACTED]

tolgee.front-end-url=https://tolgee.serviceplanet.nl
tolgee.content-delivery.public-url-prefix=https://tolgee.serviceplanet.nl

tolgee.cache.enabled=true
tolgee.cache.caffeine-max-size=50000
tolgee.cache.use-redis=false

tolgee.authentication.enabled=true
# Do not allow self-registration.
tolgee.authentication.native-enabled=false
tolgee.authentication.registrations-allowed=false
tolgee.authentication.secured-image-retrieval=true
tolgee.authentication.initial-username=[REDACTED]
tolgee.authentication.initial-password=[REDACTED]
tolgee.authentication.jwt-secret=[REDACTED]
tolgee.user-can-create-organizations=false

tolgee.machine-translation.free-credits-amount=-1
tolgee.machine-translation.google.api-key=[REDACTED]
tolgee.machine-translation.google.default-primary=true
tolgee.machine-translation.google.default-enabled=true

# Google login
tolgee.authentication.google.client-id=[REDACTED]
tolgee.authentication.google.client-secret=[REDACTED]
tolgee.authentication.google.workspace-domain=serviceplanet.nl

tolgee.telemetry.enabled=false

tolgee.smtp.auth=false
tolgee.smtp.from=Tolgee <tolgee@serviceplanet.nl>
tolgee.smtp.host=[REDACTED]
tolgee.smtp.port=25
tolgee.smtp.ssl-enabled=false
#tolgee.smtp.username=user@company.com
#tolgee.smtp.password="omg/my/password"

Notice that we don't use Redis.

This is how we start Tolgee:

/opt/local/java/openjdk17/bin/java \
  -Dnetworkaddress.cache.ttl=5 \
  -Dnetworkaddress.cache.negative.ttl=5 \
  -XX:ActiveProcessorCount=3 \
  -XX:+UseShenandoahGC \
  -Xms2G \
  -Xmx2G \
  -jar /opt/tolgee/tolgee.jar \
  --spring.config.additional-location=/etc/tolgee/application.properties

Both your example XLIFF and our larger XLIFF still trigger the issue that the web interface becomes unresponsive and Tolgee takes about 7 minutes to load the the data in to Hibernate. We tested with Tolgee 6005b8c4671673864369f440de3a0f953d13c5be.

JanCizmar commented 11 months ago

Thanks for the information. Anyway, I cannot reproduce it locally, so I really need some minimal steps to reproduce it locally on a fresh Tolgee run.

siepkes commented 11 months ago

Alright let me get my ducks in a row first, I'll get back to you!

siepkes commented 11 months ago

@JanCizmar Quick question; Did you test with real machine translation and multiple languages (9+) in a project? Or are you testing with the mock machine translation?

JanCizmar commented 11 months ago

I am testing it with mock. Base and 2 languages.

siepkes commented 11 months ago

Allright here we go, bear with me here because there are quite some steps ;-)

Assuming you already have a running PostgreSQL instance. This test was done with PostgreSQL 13. Create a clean test database :

DROP SCHEMA public CASCADE;
CREATE SCHEMA public;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO public;
REVOKE ALL PRIVILEGES ON DATABASE postgres FROM tolgee;
DROP ROLE tolgee;
CREATE ROLE tolgee WITH LOGIN ENCRYPTED PASSWORD 'ZaKxwAXKNxd7CAYF7AoNhVr0WusIGUZuPw40';
GRANT ALL PRIVILEGES ON DATABASE postgres TO tolgee;
CREATE EXTENSION IF NOT EXISTS pg_trgm SCHEMA public;
CREATE EXTENSION IF NOT EXISTS unaccent WITH SCHEMA public;

Next we will build and run Tolgee. We will make a small modification to the Google Machine translation provider. The provider no longer actually calls the Google API, it just sleeps for a second and returns the original string. This makes it a more realistic test then the mock one, since responses over network aren't instant:

$ git clone https://github.com/tolgee/tolgee-platform.git tolgee-queue-test
$ cd tolgee-queue-test
$ git checkout v3.41.3
$ cat > ./mock-google.patch << EOF
---
 .../providers/GoogleTranslationProvider.kt                | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/backend/data/src/main/kotlin/io/tolgee/component/machineTranslation/providers/GoogleTranslationProvider.kt b/backend/data/src/main/kotlin/io/tolgee/component/machineTranslation/providers/GoogleTranslationProvider.kt
index 27be746c..f2df77c8 100644
--- a/backend/data/src/main/kotlin/io/tolgee/component/machineTranslation/providers/GoogleTranslationProvider.kt
+++ b/backend/data/src/main/kotlin/io/tolgee/component/machineTranslation/providers/GoogleTranslationProvider.kt
@@ -17,12 +17,8 @@ class GoogleTranslationProvider(
     get() = !googleMachineTranslationProperties.apiKey.isNullOrEmpty()

   override fun translateViaProvider(params: ProviderTranslateParams): MtValueProvider.MtResult {
-    val result = translateService.translate(
-      params.text,
-      Translate.TranslateOption.sourceLanguage(params.sourceLanguageTag),
-      Translate.TranslateOption.targetLanguage(params.targetLanguageTag),
-      Translate.TranslateOption.format("text")
-    ).translatedText
+    Thread.sleep(1000)
+    val result = params.text + " - Google (not) translated..."
     return MtValueProvider.MtResult(
       result,
       params.text.length * 100
-- 
EOF
$ patch -p1 < ./mock-google.patch
$ cat > application.yaml <<EOF
spring:
  jpa:
    show-sql: true
  datasource.url: "jdbc:postgresql://127.0.0.1:5432/postgres"
  datasource.username: "tolgee"
  datasource.password: "ZaKxwAXKNxd7CAYF7AoNhVr0WusIGUZuPw40"
tolgee:
  postgres-autostart:
    enabled: false
  billing:
    enabled: false
  authentication:
    enabled: true
    needs-email-verification: true
    registrations-allowed: true
    initial-password: "admin"
  cache:
    enabled: true
    use-redis: false
    default-ttl: 120000
    caffeine-max-size: 1000000
  machine-translation:
    free-credits-amount: -1
    google:
      api-key: mock
  telemetry:
    enabled: false
server:
  port: 8080
  error:
    include-exception: true
    include-stacktrace: always
EOF
$ rm -Rf build
$ ./gradlew packResources
$ java -Xmx2G -jar ./build/libs/tolgee.jar --spring.config.additional-location=./application.yaml

Tolgee now runs. Navigate to http://127.0.0.1:8080 and login with admin / admin.

Create a project called "Sandbox" and add the following languages:

Remove English (en), set the "Base Language" to Dutch (Netherlands) and click "Save".

Goto "Languages". Open the settings for Machine translations and ensure the following options are enabled:

Goto "Import" and select the large XLIFF test file (translations.xliff.zip) for import:

Import the file.

Wait a bit (couple of minutes).

The application will become unresponsive. For example logging in doesn't work anymore.

siepkes commented 11 months ago

Interestingly enough with the above reproducer instructions in the end Tolgee 3.41.3 also runs out of heap space (even when running with -Xmx2G):

Hibernate: select bj1_0.id,bj1_0.author_id,bj1_0.chunk_size,bj1_0.created_at,bj1_0.debounce_duration_in_ms,bj1_0.debounce_max_wait_time_in_ms,bj1_0.debouncing_key,bj1_0.hidden,bj1_0.job_character,bj1_0.last_debouncing_event,bj1_0.max_per_job_concurrency,bj1_0.params,bj1_0.project_id,bj1_0.status,bj1_0.target,bj1_0.total_chunks,bj1_0.total_items,bj1_0.type,bj1_0.updated_at from tolgee_batch_job bj1_0 where bj1_0.id in (?) and bj1_0.status in (?,?,?,?) and bj1_0.updated_at<?
Hibernate: select bj1_0.id,bj1_0.author_id,bj1_0.chunk_size,bj1_0.created_at,bj1_0.debounce_duration_in_ms,bj1_0.debounce_max_wait_time_in_ms,bj1_0.debouncing_key,bj1_0.hidden,bj1_0.job_character,bj1_0.last_debouncing_event,bj1_0.max_per_job_concurrency,bj1_0.params,bj1_0.project_id,bj1_0.status,bj1_0.target,bj1_0.total_chunks,bj1_0.total_items,bj1_0.type,bj1_0.updated_at from tolgee_batch_job bj1_0 join tolgee_batch_job_chunk_execution bjce1_0 on bjce1_0.batch_job_id=bj1_0.id where bj1_0.id in (?) group by bj1_0.id having max(bjce1_0.updated_at)<?
2023-12-20T12:28:42.489+01:00  WARN 46977 --- [essageBroker-28] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 0, SQLState: 53200
2023-12-20T12:28:42.489+01:00 ERROR 46977 --- [essageBroker-28] o.h.engine.jdbc.spi.SqlExceptionHelper   : Ran out of memory retrieving query results.
2023-12-20T12:28:42.669+01:00 ERROR 46977 --- [essageBroker-28] o.s.s.s.TaskUtils$LoggingErrorHandler    : Unexpected error occurred in scheduled task

org.springframework.orm.jpa.JpaSystemException: JDBC exception executing SQL [select bj1_0.id,bj1_0.author_id,bj1_0.chunk_size,bj1_0.created_at,bj1_0.debounce_duration_in_ms,bj1_0.debounce_max_wait_time_in_ms,bj1_0.debouncing_key,bj1_0.hidden,bj1_0.job_character,bj1_0.last_debouncing_event,bj1_0.max_per_job_concurrency,bj1_0.params,bj1_0.project_id,bj1_0.status,bj1_0.target,bj1_0.total_chunks,bj1_0.total_items,bj1_0.type,bj1_0.updated_at from tolgee_batch_job bj1_0 join tolgee_batch_job_chunk_execution bjce1_0 on bjce1_0.batch_job_id=bj1_0.id where bj1_0.id in (?) group by bj1_0.id having max(bjce1_0.updated_at)<?] [Ran out of memory retrieving query results.] [n/a]
    at org.springframework.orm.jpa.vendor.HibernateJpaDialect.convertHibernateAccessException(HibernateJpaDialect.java:320) ~[spring-orm-6.0.13.jar!/:6.0.13]
    at org.springframework.orm.jpa.vendor.HibernateJpaDialect.translateExceptionIfPossible(HibernateJpaDialect.java:229) ~[spring-orm-6.0.13.jar!/:6.0.13]
    at org.springframework.orm.jpa.AbstractEntityManagerFactoryBean.translateExceptionIfPossible(AbstractEntityManagerFactoryBean.java:550) ~[spring-orm-6.0.13.jar!/:6.0.13]
    at org.springframework.dao.support.ChainedPersistenceExceptionTranslator.translateExceptionIfPossible(ChainedPersistenceExceptionTranslator.java:61) ~[spring-tx-6.0.13.jar!/:6.0.13]
    at org.springframework.dao.support.DataAccessUtils.translateIfNecessary(DataAccessUtils.java:243) ~[spring-tx-6.0.13.jar!/:6.0.13]
    at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:152) ~[spring-tx-6.0.13.jar!/:6.0.13]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.data.jpa.repository.support.CrudMethodMetadataPostProcessor$CrudMethodMetadataPopulatingMethodInterceptor.invoke(CrudMethodMetadataPostProcessor.java:135) ~[spring-data-jpa-3.1.5.jar!/:3.1.5]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.data.repository.core.support.MethodInvocationValidator.invoke(MethodInvocationValidator.java:94) ~[spring-data-commons-3.1.5.jar!/:3.1.5]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:244) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at jdk.proxy2/jdk.proxy2.$Proxy293.getStuckJobs(Unknown Source) ~[na:na]
    at io.tolgee.batch.BatchJobService.getStuckJobs(BatchJobService.kt:401) ~[data-local.jar!/:na]
    at jdk.internal.reflect.GeneratedMethodAccessor221.invoke(Unknown Source) ~[na:na]
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
    at java.base/java.lang.reflect.Method.invoke(Method.java:568) ~[na:na]
    at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:343) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:699) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at io.tolgee.batch.BatchJobService$$SpringCGLIB$$1.getStuckJobs(<generated>) ~[data-local.jar!/:na]
    at io.tolgee.batch.ScheduledJobCleaner.handleStuckJobs(ScheduledJobCleaner.kt:33) ~[data-local.jar!/:na]
    at io.tolgee.batch.ScheduledJobCleaner.cleanup(ScheduledJobCleaner.kt:28) ~[data-local.jar!/:na]
    at jdk.internal.reflect.GeneratedMethodAccessor219.invoke(Unknown Source) ~[na:na]
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
    at java.base/java.lang.reflect.Method.invoke(Method.java:568) ~[na:na]
    at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-6.0.13.jar!/:6.0.13]
    at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-6.0.13.jar!/:6.0.13]
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[na:na]
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[na:na]
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
    at java.base/java.lang.Thread.run(Thread.java:840) ~[na:na]
Caused by: org.hibernate.exception.GenericJDBCException: JDBC exception executing SQL [select bj1_0.id,bj1_0.author_id,bj1_0.chunk_size,bj1_0.created_at,bj1_0.debounce_duration_in_ms,bj1_0.debounce_max_wait_time_in_ms,bj1_0.debouncing_key,bj1_0.hidden,bj1_0.job_character,bj1_0.last_debouncing_event,bj1_0.max_per_job_concurrency,bj1_0.params,bj1_0.project_id,bj1_0.status,bj1_0.target,bj1_0.total_chunks,bj1_0.total_items,bj1_0.type,bj1_0.updated_at from tolgee_batch_job bj1_0 join tolgee_batch_job_chunk_execution bjce1_0 on bjce1_0.batch_job_id=bj1_0.id where bj1_0.id in (?) group by bj1_0.id having max(bjce1_0.updated_at)<?] [Ran out of memory retrieving query results.] [n/a]
    at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:63) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:108) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:94) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.results.jdbc.internal.DeferredResultSetAccess.executeQuery(DeferredResultSetAccess.java:258) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.results.jdbc.internal.DeferredResultSetAccess.getResultSet(DeferredResultSetAccess.java:164) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.results.jdbc.internal.JdbcValuesResultSetImpl.advanceNext(JdbcValuesResultSetImpl.java:218) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.results.jdbc.internal.JdbcValuesResultSetImpl.processNext(JdbcValuesResultSetImpl.java:98) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.results.jdbc.internal.AbstractJdbcValues.next(AbstractJdbcValues.java:19) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.results.internal.RowProcessingStateStandardImpl.next(RowProcessingStateStandardImpl.java:66) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.results.spi.ListResultsConsumer.consume(ListResultsConsumer.java:180) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.results.spi.ListResultsConsumer.consume(ListResultsConsumer.java:33) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.exec.internal.JdbcSelectExecutorStandardImpl.doExecuteQuery(JdbcSelectExecutorStandardImpl.java:361) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.exec.internal.JdbcSelectExecutorStandardImpl.executeQuery(JdbcSelectExecutorStandardImpl.java:168) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.exec.internal.JdbcSelectExecutorStandardImpl.list(JdbcSelectExecutorStandardImpl.java:93) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.sql.exec.spi.JdbcSelectExecutor.list(JdbcSelectExecutor.java:31) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.query.sqm.internal.ConcreteSqmSelectQueryPlan.lambda$new$0(ConcreteSqmSelectQueryPlan.java:109) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.query.sqm.internal.ConcreteSqmSelectQueryPlan.withCacheableSqmInterpretation(ConcreteSqmSelectQueryPlan.java:305) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.query.sqm.internal.ConcreteSqmSelectQueryPlan.performList(ConcreteSqmSelectQueryPlan.java:246) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.query.sqm.internal.QuerySqmImpl.doList(QuerySqmImpl.java:509) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.query.spi.AbstractSelectionQuery.list(AbstractSelectionQuery.java:427) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at org.hibernate.query.Query.getResultList(Query.java:120) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    at jdk.internal.reflect.GeneratedMethodAccessor166.invoke(Unknown Source) ~[na:na]
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
    at java.base/java.lang.reflect.Method.invoke(Method.java:568) ~[na:na]
    at org.springframework.orm.jpa.SharedEntityManagerCreator$DeferredQueryInvocationHandler.invoke(SharedEntityManagerCreator.java:405) ~[spring-orm-6.0.13.jar!/:6.0.13]
    at jdk.proxy2/jdk.proxy2.$Proxy332.getResultList(Unknown Source) ~[na:na]
    at org.springframework.data.jpa.repository.query.JpaQueryExecution$CollectionExecution.doExecute(JpaQueryExecution.java:129) ~[spring-data-jpa-3.1.5.jar!/:3.1.5]
    at org.springframework.data.jpa.repository.query.JpaQueryExecution.execute(JpaQueryExecution.java:92) ~[spring-data-jpa-3.1.5.jar!/:3.1.5]
    at org.springframework.data.jpa.repository.query.AbstractJpaQuery.doExecute(AbstractJpaQuery.java:149) ~[spring-data-jpa-3.1.5.jar!/:3.1.5]
    at org.springframework.data.jpa.repository.query.AbstractJpaQuery.execute(AbstractJpaQuery.java:137) ~[spring-data-jpa-3.1.5.jar!/:3.1.5]
    at org.springframework.data.repository.core.support.RepositoryMethodInvoker.doInvoke(RepositoryMethodInvoker.java:136) ~[spring-data-commons-3.1.5.jar!/:3.1.5]
    at org.springframework.data.repository.core.support.RepositoryMethodInvoker.invoke(RepositoryMethodInvoker.java:120) ~[spring-data-commons-3.1.5.jar!/:3.1.5]
    at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.doInvoke(QueryExecutorMethodInterceptor.java:164) ~[spring-data-commons-3.1.5.jar!/:3.1.5]
    at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.invoke(QueryExecutorMethodInterceptor.java:143) ~[spring-data-commons-3.1.5.jar!/:3.1.5]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.data.projection.DefaultMethodInvokingMethodInterceptor.invoke(DefaultMethodInvokingMethodInterceptor.java:72) ~[spring-data-commons-3.1.5.jar!/:3.1.5]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123) ~[spring-tx-6.0.13.jar!/:6.0.13]
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:391) ~[spring-tx-6.0.13.jar!/:6.0.13]
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) ~[spring-tx-6.0.13.jar!/:6.0.13]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184) ~[spring-aop-6.0.13.jar!/:6.0.13]
    at org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:137) ~[spring-tx-6.0.13.jar!/:6.0.13]
    ... 29 common frames omitted
Caused by: org.postgresql.util.PSQLException: Ran out of memory retrieving query results.
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2372) ~[postgresql-42.6.0.jar!/:42.6.0]
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368) ~[postgresql-42.6.0.jar!/:42.6.0]
    at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:498) ~[postgresql-42.6.0.jar!/:42.6.0]
    at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:415) ~[postgresql-42.6.0.jar!/:42.6.0]
    at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:190) ~[postgresql-42.6.0.jar!/:42.6.0]
    at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:134) ~[postgresql-42.6.0.jar!/:42.6.0]
    at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52) ~[HikariCP-5.0.1.jar!/:na]
    at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java) ~[HikariCP-5.0.1.jar!/:na]
    at org.hibernate.sql.results.jdbc.internal.DeferredResultSetAccess.executeQuery(DeferredResultSetAccess.java:240) ~[hibernate-core-6.3.1.Final.jar!/:6.3.1.Final]
    ... 67 common frames omitted
Caused by: java.lang.OutOfMemoryError: Java heap space

This is what it looks like in Eclipse MAT:

Screenshot from 2023-12-20 12-33-06

This is the stack trace from Eclipse MAT:

DefaultDispatcher-worker-1
  at sun.nio.ch.SocketDispatcher.read0(Ljava/io/FileDescriptor;JI)I (SocketDispatcher.java(Native Method))
  at sun.nio.ch.SocketDispatcher.read(Ljava/io/FileDescriptor;JI)I (SocketDispatcher.java:47)
  at sun.nio.ch.NioSocketImpl.tryRead(Ljava/io/FileDescriptor;[BII)I (NioSocketImpl.java:266)
  at sun.nio.ch.NioSocketImpl.implRead([BII)I (NioSocketImpl.java:317)
  at sun.nio.ch.NioSocketImpl.read([BII)I (NioSocketImpl.java:355)
  at sun.nio.ch.NioSocketImpl$1.read([BII)I (NioSocketImpl.java:808)
  at java.net.Socket$SocketInputStream.read([BII)I (Socket.java:966)
  at java.io.InputStream.skip(J)J (InputStream.java:547)
  at org.postgresql.core.VisibleBufferedInputStream.skip(J)J (VisibleBufferedInputStream.java:278)
  at org.postgresql.core.PGStream.skip(I)V (PGStream.java:669)
  at org.postgresql.core.PGStream.receiveTupleV3()Lorg/postgresql/core/Tuple; (PGStream.java:621)
  at org.postgresql.core.v3.QueryExecutorImpl.processResults(Lorg/postgresql/core/ResultHandler;IZ)V (QueryExecutorImpl.java:2368)
  at org.postgresql.core.v3.QueryExecutorImpl.execute(Lorg/postgresql/core/Query;Lorg/postgresql/core/ParameterList;Lorg/postgresql/core/ResultHandler;IIIZ)V (QueryExecutorImpl.java:368)
  at org.postgresql.jdbc.PgStatement.executeInternal(Lorg/postgresql/core/CachedQuery;Lorg/postgresql/core/ParameterList;I)V (PgStatement.java:498)
  at org.postgresql.jdbc.PgStatement.execute(Lorg/postgresql/core/CachedQuery;Lorg/postgresql/core/ParameterList;I)V (PgStatement.java:415)
  at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(I)Z (PgPreparedStatement.java:190)
  at org.postgresql.jdbc.PgPreparedStatement.executeQuery()Ljava/sql/ResultSet; (PgPreparedStatement.java:134)
  at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery()Ljava/sql/ResultSet; (ProxyPreparedStatement.java:52)
  at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery()Ljava/sql/ResultSet; (HikariProxyPreparedStatement.java)
  at org.hibernate.sql.results.jdbc.internal.DeferredResultSetAccess.executeQuery()V (DeferredResultSetAccess.java:240)
  at org.hibernate.sql.results.jdbc.internal.DeferredResultSetAccess.getResultSet()Ljava/sql/ResultSet; (DeferredResultSetAccess.java:164)
  at org.hibernate.sql.results.jdbc.internal.JdbcValuesResultSetImpl.advanceNext()Z (JdbcValuesResultSetImpl.java:218)
  at org.hibernate.sql.results.jdbc.internal.JdbcValuesResultSetImpl.processNext(Lorg/hibernate/sql/results/jdbc/spi/RowProcessingState;)Z (JdbcValuesResultSetImpl.java:98)
  at org.hibernate.sql.results.jdbc.internal.AbstractJdbcValues.next(Lorg/hibernate/sql/results/jdbc/spi/RowProcessingState;)Z (AbstractJdbcValues.java:19)
  at org.hibernate.sql.results.internal.RowProcessingStateStandardImpl.next()Z (RowProcessingStateStandardImpl.java:66)
  at org.hibernate.sql.results.spi.ListResultsConsumer.consume(Lorg/hibernate/sql/results/jdbc/spi/JdbcValues;Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/sql/results/jdbc/spi/JdbcValuesSourceProcessingOptions;Lorg/hibernate/sql/results/jdbc/internal/JdbcValuesSourceProcessingStateStandardImpl;Lorg/hibernate/sql/results/internal/RowProcessingStateStandardImpl;Lorg/hibernate/sql/results/spi/RowReader;)Ljava/util/List; (ListResultsConsumer.java:200)
  at org.hibernate.sql.results.spi.ListResultsConsumer.consume(Lorg/hibernate/sql/results/jdbc/spi/JdbcValues;Lorg/hibernate/engine/spi/SharedSessionContractImplementor;Lorg/hibernate/sql/results/jdbc/spi/JdbcValuesSourceProcessingOptions;Lorg/hibernate/sql/results/jdbc/internal/JdbcValuesSourceProcessingStateStandardImpl;Lorg/hibernate/sql/results/internal/RowProcessingStateStandardImpl;Lorg/hibernate/sql/results/spi/RowReader;)Ljava/lang/Object; (ListResultsConsumer.java:33)
  at org.hibernate.sql.exec.internal.JdbcSelectExecutorStandardImpl.doExecuteQuery(Lorg/hibernate/sql/exec/spi/JdbcOperationQuerySelect;Lorg/hibernate/sql/exec/spi/JdbcParameterBindings;Lorg/hibernate/sql/exec/spi/ExecutionContext;Lorg/hibernate/sql/results/spi/RowTransformer;Ljava/lang/Class;Ljava/util/function/Function;Lorg/hibernate/sql/results/spi/ResultsConsumer;)Ljava/lang/Object; (JdbcSelectExecutorStandardImpl.java:361)
  at org.hibernate.sql.exec.internal.JdbcSelectExecutorStandardImpl.executeQuery(Lorg/hibernate/sql/exec/spi/JdbcOperationQuerySelect;Lorg/hibernate/sql/exec/spi/JdbcParameterBindings;Lorg/hibernate/sql/exec/spi/ExecutionContext;Lorg/hibernate/sql/results/spi/RowTransformer;Ljava/lang/Class;Ljava/util/function/Function;Lorg/hibernate/sql/results/spi/ResultsConsumer;)Ljava/lang/Object; (JdbcSelectExecutorStandardImpl.java:168)
  at org.hibernate.sql.exec.internal.JdbcSelectExecutorStandardImpl.list(Lorg/hibernate/sql/exec/spi/JdbcOperationQuerySelect;Lorg/hibernate/sql/exec/spi/JdbcParameterBindings;Lorg/hibernate/sql/exec/spi/ExecutionContext;Lorg/hibernate/sql/results/spi/RowTransformer;Ljava/lang/Class;Lorg/hibernate/sql/results/spi/ListResultsConsumer$UniqueSemantic;)Ljava/util/List; (JdbcSelectExecutorStandardImpl.java:93)
  at org.hibernate.sql.exec.spi.JdbcSelectExecutor.list(Lorg/hibernate/sql/exec/spi/JdbcOperationQuerySelect;Lorg/hibernate/sql/exec/spi/JdbcParameterBindings;Lorg/hibernate/sql/exec/spi/ExecutionContext;Lorg/hibernate/sql/results/spi/RowTransformer;Lorg/hibernate/sql/results/spi/ListResultsConsumer$UniqueSemantic;)Ljava/util/List; (JdbcSelectExecutor.java:31)
  at org.hibernate.query.sqm.internal.ConcreteSqmSelectQueryPlan.lambda$new$0(Ljava/lang/String;Lorg/hibernate/sql/results/spi/ListResultsConsumer$UniqueSemantic;Lorg/hibernate/query/sqm/internal/DomainParameterXref;Ljava/lang/Void;Lorg/hibernate/query/spi/DomainQueryExecutionContext;Lorg/hibernate/query/sqm/internal/ConcreteSqmSelectQueryPlan$CacheableSqmInterpretation;Lorg/hibernate/sql/exec/spi/JdbcParameterBindings;)Ljava/util/List; (ConcreteSqmSelectQueryPlan.java:109)
  at org.hibernate.query.sqm.internal.ConcreteSqmSelectQueryPlan$$Lambda$2279+0x00007f9cd91c66a0.interpret(Ljava/lang/Object;Lorg/hibernate/query/spi/DomainQueryExecutionContext;Lorg/hibernate/query/sqm/internal/ConcreteSqmSelectQueryPlan$CacheableSqmInterpretation;Lorg/hibernate/sql/exec/spi/JdbcParameterBindings;)Ljava/lang/Object; ()
  at org.hibernate.query.sqm.internal.ConcreteSqmSelectQueryPlan.withCacheableSqmInterpretation(Lorg/hibernate/query/spi/DomainQueryExecutionContext;Ljava/lang/Object;Lorg/hibernate/query/sqm/internal/ConcreteSqmSelectQueryPlan$SqmInterpreter;)Ljava/lang/Object; (ConcreteSqmSelectQueryPlan.java:305)
  at org.hibernate.query.sqm.internal.ConcreteSqmSelectQueryPlan.performList(Lorg/hibernate/query/spi/DomainQueryExecutionContext;)Ljava/util/List; (ConcreteSqmSelectQueryPlan.java:246)
  at org.hibernate.query.sqm.internal.QuerySqmImpl.doList()Ljava/util/List; (QuerySqmImpl.java:509)
  at org.hibernate.query.spi.AbstractSelectionQuery.list()Ljava/util/List; (AbstractSelectionQuery.java:427)
  at org.hibernate.query.Query.getResultList()Ljava/util/List; (Query.java:120)
  at jdk.internal.reflect.GeneratedMethodAccessor166.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; ()
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Method.java:568)
  at org.springframework.orm.jpa.SharedEntityManagerCreator$DeferredQueryInvocationHandler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object; (SharedEntityManagerCreator.java:405)
  at jdk.proxy2.$Proxy332.getResultList()Ljava/util/List; ()
  at io.tolgee.batch.BatchJobService.getAllUnlockedChunksForJobs(Ljava/lang/Iterable;)Ljava/util/List; (BatchJobService.kt:308)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (NativeMethodAccessorImpl.java(Native Method))
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (Method.java:568)
  at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object; (AopUtils.java:343)
  at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;Lorg/springframework/cglib/proxy/MethodProxy;)Ljava/lang/Object; (CglibAopProxy.java:699)
  at io.tolgee.batch.BatchJobService$$SpringCGLIB$$1.getAllUnlockedChunksForJobs(Ljava/lang/Iterable;)Ljava/util/List; (<generated>)
  at io.tolgee.batch.BatchJobProjectLockingManager.getInitialJobId(J)Ljava/lang/Long; (BatchJobProjectLockingManager.kt:118)
  at io.tolgee.batch.BatchJobProjectLockingManager.computeFnBody(Lio/tolgee/batch/data/BatchJobDto;Ljava/lang/Long;)J (BatchJobProjectLockingManager.kt:99)
  at io.tolgee.batch.BatchJobProjectLockingManager.access$computeFnBody(Lio/tolgee/batch/BatchJobProjectLockingManager;Lio/tolgee/batch/data/BatchJobDto;Ljava/lang/Long;)J (BatchJobProjectLockingManager.kt:19)
  at io.tolgee.batch.BatchJobProjectLockingManager$tryLockLocal$computed$1.invoke(Ljava/lang/Long;Ljava/lang/Long;)Ljava/lang/Long; (BatchJobProjectLockingManager.kt:81)
  at io.tolgee.batch.BatchJobProjectLockingManager$tryLockLocal$computed$1.invoke(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (BatchJobProjectLockingManager.kt:80)
  at io.tolgee.batch.BatchJobProjectLockingManager.tryLockLocal$lambda$2(Lkotlin/jvm/functions/Function2;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Long; (BatchJobProjectLockingManager.kt:80)
  at io.tolgee.batch.BatchJobProjectLockingManager$$Lambda$2805+0x00007f9cd93af8e8.apply(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; ()
  at java.util.concurrent.ConcurrentHashMap.compute(Ljava/lang/Object;Ljava/util/function/BiFunction;)Ljava/lang/Object; (ConcurrentHashMap.java:1916)
  at io.tolgee.batch.BatchJobProjectLockingManager.tryLockLocal(Lio/tolgee/batch/data/BatchJobDto;)Z (BatchJobProjectLockingManager.kt:80)
  at io.tolgee.batch.BatchJobProjectLockingManager.tryLockJobForProject(Lio/tolgee/batch/data/BatchJobDto;)Z (BatchJobProjectLockingManager.kt:42)
  at io.tolgee.batch.BatchJobProjectLockingManager.canRunBatchJobOfExecution(J)Z (BatchJobProjectLockingManager.kt:34)
  at io.tolgee.batch.BatchJobConcurrentLauncher.handleItem(Lkotlinx/coroutines/CoroutineScope;Lio/tolgee/batch/data/ExecutionQueueItem;Lkotlin/jvm/functions/Function2;)Z (BatchJobConcurrentLauncher.kt:186)
  at io.tolgee.batch.BatchJobConcurrentLauncher.access$handleItem(Lio/tolgee/batch/BatchJobConcurrentLauncher;Lkotlinx/coroutines/CoroutineScope;Lio/tolgee/batch/data/ExecutionQueueItem;Lkotlin/jvm/functions/Function2;)Z (BatchJobConcurrentLauncher.kt:25)
  at io.tolgee.batch.BatchJobConcurrentLauncher$run$1$1.invoke()Ljava/lang/Boolean; (BatchJobConcurrentLauncher.kt:122)
  at io.tolgee.batch.BatchJobConcurrentLauncher$run$1$1.invoke()Ljava/lang/Object; (BatchJobConcurrentLauncher.kt:104)
  at io.tolgee.batch.BatchJobConcurrentLauncher.repeatForever(Lkotlin/jvm/functions/Function0;)V (BatchJobConcurrentLauncher.kt:82)
  at io.tolgee.batch.BatchJobConcurrentLauncher$run$1.invokeSuspend(Ljava/lang/Object;)Ljava/lang/Object; (BatchJobConcurrentLauncher.kt:104)
  at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(Ljava/lang/Object;)V (ContinuationImpl.kt:33)
  at kotlinx.coroutines.DispatchedTask.run()V (DispatchedTask.kt:106)
  at kotlinx.coroutines.internal.LimitedDispatcher.run()V (LimitedDispatcher.kt:42)
  at kotlinx.coroutines.scheduling.TaskImpl.run()V (Tasks.kt:95)
  at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(Lkotlinx/coroutines/scheduling/Task;)V (CoroutineScheduler.kt:570)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(Lkotlinx/coroutines/scheduling/Task;)V (CoroutineScheduler.kt:750)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker()V (CoroutineScheduler.kt:677)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run()V (CoroutineScheduler.kt:664)
JanCizmar commented 10 months ago

OK. I've got the fix. There were multiple places where we were fetching chunk executions including the fetched batch job. Since the batch job has large target, fetching it 5000 times makes the response quite big. Thanks for reporting guys! ๐Ÿ™ I hope this really finally fixes that. ๐Ÿค˜

siepkes commented 10 months ago

@JanCizmar Thanks for your efforts in fixing this issue and happy holidays! :fireworks:

GeertZondervan commented 10 months ago

@JanCizmar Just wanted to report that we do not encounter any problems with the latest release. Thanks for fixing the issue.

JanCizmar commented 10 months ago

Great news!