Open ImTangYun opened 3 years ago
similar to https://github.com/prestodb/presto/issues/11952#issuecomment-734006863 restarting workers every 5 days seems to solve for me
similar to prestodb/presto#11952 (comment) restarting workers every 5 days seems to solve for me
Our clusters become slow only after 1~2hours, and restart every 1hour is not a good choice, do you have any ideas for the core cause of slowing?
@ImTangYun can u take some diagnostic dumps like jmap/jstack, 10 minutes after restart vs 4 hours after restart and compare them? My guess is some memory leak
@ImTangYun can u take some diagnostic dumps like jmap/jstack, 10 minutes after restart vs 4 hours after restart and compare them? My guess is some memory leak
Sounds a good way to find the problem, i'll try it these days,thanks
Is there anybody know what is 'io.prestosql.operator.project.GeneratedPageProjection.project(GeneratedPageProjection.java)' doing? Why the stack always shows the thread is hanging here when worker become slow?
@yingsu00
Is there anybody know what is 'io.prestosql.operator.project.GeneratedPageProjection.project(GeneratedPageProjection.java)' doing? Why the stack always shows the thread is hanging here when worker become slow?
(also reported as https://github.com/prestosql/presto/issues/6435)
Any updates? @findepi
@ImTangYun
PerMethodRecompilationCutoff=10000
and PerBytecodeRecompilationCutoff=10000
and report if it did help regression issue?@ImTangYun
- Could you set JVM properties
PerMethodRecompilationCutoff=10000
andPerBytecodeRecompilationCutoff=10000
and report if it did help regression issue?
Already set, see the config above. @sopel39
Our cluster had the same problem, we restart the cluster every two weeks. @ImTangYun do you have solved the problem?
Our cluster had the same problem, we restart the cluster every two weeks. @ImTangYun do you have solved the problem?
No we restart the cluster every 2hours [破涕为笑]
@ImTangYun Is it possible to isolate this issue to a particular query? Or it degrades after you run mix of queries? Did you try newest Trino version?
Hi, guys is there any updates for this issue? Seems it happens sometimes. Some works still becomes slowly with below jvm parmeters alreay set. -XX:PerMethodRecompilationCutoff=10000 -XX:PerBytecodeRecompilationCutoff=10000
has anyone tested after https://github.com/trinodb/trino/pull/13064 fix?
Hi, guys is there any updates for this issue? Seems it happens sometimes. Some works still becomes slowly with below jvm parmeters alreay set. -XX:PerMethodRecompilationCutoff=10000 -XX:PerBytecodeRecompilationCutoff=10000
It seems this is an issue caused by JDK-8243615. You can see more details here.
The default Cutoff parameters are:
java -XX:+PrintFlagsFinal -version | grep Cutoff
intx LiveNodeCountInliningCutoff = 40000 {C2 product} {default}
intx PerBytecodeRecompilationCutoff = 200 {product} {default}
intx PerMethodRecompilationCutoff = 400 {product} {default}
So, IMO, tuning these parameters just delays the slow problem but does not solve it. Maybe the only way is to fix it in JDK.
From web ui we see that many query ware running, but work parallelism almost 0, and cpu usage is very low
After we restart cluster, the cluster performs quit well, but slow down quikly after about 1 to 2hours, do you know why? does presto need to be restarted frequently at facebook?
Some important info: Our querys are quit big, many querys scan 5+TB physical data. We see that the cluster slow down slower when querys are small
the key configs:
we had clusters with about 41 worker nodes, the hardware is:
cpu with 96 cores
512GB memory
1 ssd
10 Gigabit Ethernet
presto version: 332 hive connector with data stored at hdfs
jvm: -server -Xms450G -Xmx450G -Xss8M -XX:+UseG1GC -XX:G1HeapWastePercent=5 -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=48 -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -DHADOOP_USER_NAME=hdfs -Dpresto-temporarily-allow-java8=true -XX:PerMethodRecompilationCutoff=10000 -XX:PerBytecodeRecompilationCutoff=10000 -XX:ReservedCodeCacheSize=2G -XX:+UseCodeCacheFlushing -XX:NativeMemoryTracking=detail -XX:+PrintCompilation -XX:+CITime -XX:+PrintCodeCache -Djdk.nio.maxCachedBufferSize=4000000 -Djdk.attach.allowAttachSelf=true -XX:G1HeapRegionSize=32M
presto config,properties: query.max-memory=2500GB query.max-history=3000 experimental.reserved-pool-disabled=true http-server.log.max-size=100MB http-server.http.port=8286 log.max-size=100MB node-scheduler.include-coordinator=false node-scheduler.max-splits-increment-for-caching=300 query.low-memory-killer.delay=1m http-server.accept-queue-size=16000 distributed-sort=true exchange.http-client.idle-timeout=1m log.max-history=10 optimizer.enable-intermediate-aggregations=true query.low-memory-killer.policy=total-reservation-on-blocked-nodes task.concurrency=32 http-server.http.selector-threads=32 query.max-total-memory=8000GB task.max-worker-threads=100 join-distribution-type=AUTOMATIC node-scheduler.use-cacheable-white-list=true query.client.timeout=5m query.max-memory-per-node=200GB optimizer.default-filter-factor-enabled=true exchange.compression-enabled=true task.max-leaf-splits-per-node=50 node-scheduler.max-splits-per-node=100 http-server.threads.max=500 query.max-total-memory-per-node=256GB join-max-broadcast-table-size=2GB http-server.http.acceptor-threads=32 writer-min-size=128MB http-server.threads.min=50 discovery.uri=http://master:8000 optimizer.join-reordering-strategy=AUTOMATIC http-server.log.max-history=10 memory.heap-headroom-per-node=48GB optimizer.optimize-mixed-distinct-aggregations=true optimizer.use-mark-distinct=true query.max-length=600000 coordinator=false
many worker threads stuck at:
java.lang.Thread.State: RUNNABLE at jdk.internal.misc.Unsafe.defineAnonymousClass0(java.base@11.0.8/Native Method) at jdk.internal.misc.Unsafe.defineAnonymousClass(java.base@11.0.8/Unsafe.java:1225) at java.lang.invoke.InvokerBytecodeGenerator.loadAndInitializeInvokerClass(java.base@11.0.8/InvokerBytecodeGenerator.java:295) at java.lang.invoke.InvokerBytecodeGenerator.loadMethod(java.base@11.0.8/InvokerBytecodeGenerator.java:287) at java.lang.invoke.InvokerBytecodeGenerator.generateCustomizedCode(java.base@11.0.8/InvokerBytecodeGenerator.java:693) at java.lang.invoke.LambdaForm.compileToBytecode(java.base@11.0.8/LambdaForm.java:871) at java.lang.invoke.LambdaForm.customize(java.base@11.0.8/LambdaForm.java:506) at java.lang.invoke.MethodHandle.customize(java.base@11.0.8/MethodHandle.java:1675) at java.lang.invoke.Invokers.maybeCustomize(java.base@11.0.8/Invokers.java:582) at java.lang.invoke.Invokers.checkCustomized(java.base@11.0.8/Invokers.java:573) at java.lang.invoke.Invokers$Holder.invoke_MT(java.base@11.0.8/Invokers$Holder) at io.prestosql.operator.project.GeneratedPageProjection.project(GeneratedPageProjection.java:76) at io.prestosql.operator.project.PageProcessor$ProjectSelectedPositions.processBatch(PageProcessor.java:330) at io.prestosql.operator.project.PageProcessor$ProjectSelectedPositions.process(PageProcessor.java:205) at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372) at io.prestosql.operator.WorkProcessorUtils.lambda$flatten$6(WorkProcessorUtils.java:277) at io.prestosql.operator.WorkProcessorUtils$$Lambda$3039/0x00007ebdf82cb840.process(Unknown Source) at io.prestosql.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:319) at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372) at io.prestosql.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306) at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372) at io.prestosql.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221) at io.prestosql.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:200) at io.prestosql.operator.WorkProcessorUtils$$Lambda$3090/0x00007ebdf835d8b0.process(Unknown Source) at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372) at io.prestosql.operator.WorkProcessorUtils.lambda$flatten$6(WorkProcessorUtils.java:277) at io.prestosql.operator.WorkProcessorUtils$$Lambda$3039/0x00007ebdf82cb840.process(Unknown Source) at io.prestosql.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:319) at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372) at io.prestosql.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306)
Flame graph