trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.42k stars 3k forks source link

Verify that long running JDBC connector queries are cancelled #7159

Open electrum opened 3 years ago

electrum commented 3 years ago

We should write a test that runs a query that produces many rows (maybe a cross join on a large table) and verify that the query killed on the server afterwards (by using a database specific method of listing the running queries). We can embed a UUID in the SQL query as a comment to identify it.

kokosing commented 3 years ago

Another way is to have an UDF that can sleep.

The problem is that it is very database specific, so that would require writing dedicated code for every JDBC connector.

CC: @ssheikin

findepi commented 3 years ago

There is pg_sleep in PostgreSQL.

findepi commented 3 years ago

Per discussion offline with @kokosing @sopel39 @erichwang there is actually a bug that long running JDBC queries are not getting cancelled, if getting next row (io.trino.plugin.jdbc.JdbcRecordCursor#advanceNextPosition) takes very long time. While this is ultimately connector- (JDBC driver-) -dependent, it is a side effect of the fact we rely in thread interruption, which is known not to be reliable cancellation mechanism in practice. io.trino.plugin.jdbc.JdbcRecordCursor#close does not get called when io.trino.plugin.jdbc.JdbcRecordCursor#advanceNextPosition blocks for very long time.

davseitsev commented 3 years ago

We have the same issue when run query with regexp_like filter which can't find any row. Underlying plugin returns rows quickly but ScanFilterAndProjectOperator can't find any matching row and got stuck when the query is cancelled. Also it consumes CPU and memory pool. Even 1 such query make the whole cluster stuck.

Here is stack trace of hang thread

"20210319_123048_22212_cbxjr.3.0-1-49204" #49204 prio=5 os_prio=0 tid=0x00007f3fb4013740 nid=0x4636 runnable [0x00007f3e195d2000]
   java.lang.Thread.State: RUNNABLE
        at io.airlift.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:203)
        at io.airlift.joni.Matcher.matchCheck(Matcher.java:304)
        at io.airlift.joni.Matcher.searchInterruptible(Matcher.java:469)
        at io.airlift.joni.Matcher.search(Matcher.java:318)
        at io.prestosql.operator.scalar.JoniRegexpFunctions.regexpLike(JoniRegexpFunctions.java:55)
        at java.lang.invoke.LambdaForm$DMH/453671855.invokeStatic_LL_I(LambdaForm$DMH)
        at java.lang.invoke.LambdaForm$MH/595542127.linkToTargetMethod(LambdaForm$MH)
        at io.prestosql.$gen.CursorProcessor_20210319_123054_20630.filter(Unknown Source)
        at io.prestosql.$gen.CursorProcessor_20210319_123054_20630.process(Unknown Source)
        at io.prestosql.operator.ScanFilterAndProjectOperator$RecordCursorToPages.process(ScanFilterAndProjectOperator.java:309)
        at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
        at io.prestosql.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
        at io.prestosql.operator.WorkProcessorUtils.access$000(WorkProcessorUtils.java:37)
        at io.prestosql.operator.WorkProcessorUtils$YieldingProcess.process(WorkProcessorUtils.java:181)
        at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
        at io.prestosql.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
        at io.prestosql.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:200)
        at io.prestosql.operator.WorkProcessorUtils$$Lambda$5970/335637031.process(Unknown Source)
        at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
        at io.prestosql.operator.WorkProcessorUtils.lambda$flatten$6(WorkProcessorUtils.java:277)
        at io.prestosql.operator.WorkProcessorUtils$$Lambda$5966/513430371.process(Unknown Source)
        at io.prestosql.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:319)
        at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
        at io.prestosql.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306)
        at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
        at io.prestosql.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
        at io.prestosql.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:200)
        at io.prestosql.operator.WorkProcessorUtils$$Lambda$5970/335637031.process(Unknown Source)
        at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
        at io.prestosql.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
        at io.prestosql.operator.WorkProcessorUtils.lambda$finishWhen$3(WorkProcessorUtils.java:215)
        at io.prestosql.operator.WorkProcessorUtils$$Lambda$5972/2142800982.process(Unknown Source)
        at io.prestosql.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
        at io.prestosql.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:148)
        at io.prestosql.operator.Driver.processInternal(Driver.java:379)
        at io.prestosql.operator.Driver.lambda$processFor$8(Driver.java:283)
        at io.prestosql.operator.Driver$$Lambda$6011/973436375.get(Unknown Source)
        at io.prestosql.operator.Driver.tryWithLock(Driver.java:675)
        at io.prestosql.operator.Driver.processFor(Driver.java:276)
        at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
        at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
        at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
        at io.prestosql.$gen.Presto_326____20210318_133749_2.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Sample query:

select * from prod.x.query_stats
where   executionstarttime >= to_unixtime(timestamp '2021-01-01 00:00:00') * 1000
    and executionstarttime <  to_unixtime(timestamp '2021-03-20 00:00:00') * 1000
    and ( regexp_like(query, '(?i).*rename\s+to\s+prod.a.t_backup.*') OR regexp_like(query, '(?i).*drop\s+table.+prod.a.t_backup.*') )
    order by createtime desc

Underlying catalog is parquet based with native presto-parquet reader v346.

sopel39 commented 3 years ago

@davseitsev you might want to try feature config regex-library=re2j, which should have more predictable performance in some cases.

davseitsev commented 3 years ago

Just want to clarify that the issue is that a query stucks on workers when it is cancelled on coordinator and keep consuming resources.

findepi commented 3 years ago

Just want to clarify that the issue is that a query stucks on workers when it is cancelled on coordinator and keep consuming resources.

Make sense. This is because code can be spending lot of time within single projection. This is, however, a different problem than one being addressed here, although I agree that from user perspective they look similar.

kokosing commented 3 years ago

@davseitsev Would you like to create a separate Github issue for your case? Notice that your issues is not JDBC connector related at all. So solving your problem requires dedicated separate effort than solving the issue described here.