Closed zz22394 closed 6 years ago
@arhimondr can confirm, but I think this is fixed by https://github.com/prestodb/presto/pull/8950.
@cawallin Thanks for your info, it is very helpful !
This should be fixed in the next release (0.185).
We upgraded this morning our development cluster to 0.185 and we still suffer from this issue. Whenever a node is restarted, the next query will fail with the "Filesystem Closed" exception. Someone already spotted that the fix was ineffective on https://github.com/prestodb/presto-hadoop-apache2/issues/20 .
On the CLI:
com.facebook.presto.spi.PrestoException: Filesystem closed
at com.facebook.presto.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:221)
at com.facebook.presto.hive.HiveRecordCursor.advanceNextPosition(HiveRecordCursor.java:179)
at com.facebook.presto.spi.RecordPageSource.getNextPage(RecordPageSource.java:99)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:262)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:337)
at com.facebook.presto.operator.Driver.lambda$processFor$6(Driver.java:241)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:614)
at com.facebook.presto.operator.Driver.processFor(Driver.java:235)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:485)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:868)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at com.facebook.presto.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:206)
... 14 more
Suppressed: java.lang.RuntimeException: java.io.IOException: Filesystem closed
at com.google.common.base.Throwables.propagate(Throwables.java:240)
at com.facebook.presto.hive.GenericHiveRecordCursor.close(GenericHiveRecordCursor.java:535)
at com.facebook.presto.hive.HiveUtil.closeWithSuppression(HiveUtil.java:827)
at com.facebook.presto.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:220)
... 14 more
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:710)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:284)
at com.facebook.presto.hive.GenericHiveRecordCursor.close(GenericHiveRecordCursor.java:532)
... 16 more
On the freshly restarted node, (we restarted datanode-dc1r04n01.hostname.tld, and the output is pruned since it's really long):
[...]
2017-09-26T12:23:06.019Z DEBUG 20170926_122256_00014_mcw2u.1.10-84-83 com.facebook.presto.execution.executor.TaskExecutor Split 20170926_122256_00014_mcw2u.1.10-84 {path=hdfs://mycluster/projects/HIDDEN.csv, start=335544320, length=67108864, fileSize=2295956258, hosts=[datanode-dc1r04n03.hostname.tld, datanode-dc1r05n05.hostname.tld, datanode-dc1r04n01.hostname.tld], database=HIDDEN, table=HIDDEN, forceLocalScheduling=false, partitionName=<UNPARTITIONED>} (start = 1.310761482810364E10, wall = 3 ms, cpu = 0 ms, wait = 0 ms, calls = 1) is finished
2017-09-26T12:23:06.017Z ERROR 20170926_122256_00014_mcw2u.1.10-8-100 com.facebook.presto.operator.Driver Error closing operator 0 for task 20170926_122256_00014_mcw2u.1.10
java.lang.RuntimeException: java.io.IOException: Filesystem closed
at com.google.common.base.Throwables.propagate(Throwables.java:240)
at com.facebook.presto.hive.GenericHiveRecordCursor.close(GenericHiveRecordCursor.java:535)
at com.facebook.presto.hive.HiveRecordCursor.close(HiveRecordCursor.java:245)
at com.facebook.presto.spi.RecordPageSource.close(RecordPageSource.java:80)
at com.facebook.presto.operator.TableScanOperator.finish(TableScanOperator.java:205)
at com.facebook.presto.operator.TableScanOperator.close(TableScanOperator.java:194)
at com.facebook.presto.operator.Driver.destroyIfNecessary(Driver.java:454)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:622)
at com.facebook.presto.operator.Driver.processFor(Driver.java:235)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:485)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:808)
at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:710)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:284)
at com.facebook.presto.hive.GenericHiveRecordCursor.close(GenericHiveRecordCursor.java:532)
... 14 more
2017-09-26T12:23:06.019Z DEBUG 20170926_122256_00014_mcw2u.1.10-88-85 com.facebook.presto.execution.executor.TaskExecutor Split 20170926_122256_00014_mcw2u.1.10-88 {path=hdfs://mycluster/HIDDEN.csv, start=1610612736, length=67108864, fileSize=2295956258, hosts=[datanode-dc1r04n02.hostname.tld, datanode-dc1r04n04.hostname.tld, datanode-dc1r04n03.hostname.tld], database=HIDDEN, table=HIDDEN, forceLocalScheduling=false, partitionName=<UNPARTITIONED>} (start = 1.3107614828379995E10, wall = 3 ms, cpu = 0 ms, wait = 0 ms, calls = 1) is finished
2017-09-26T12:23:06.019Z DEBUG 20170926_122256_00014_mcw2u.1.10-8-100 com.facebook.presto.execution.executor.TaskExecutor Split 20170926_122256_00014_mcw2u.1.10-8 {path=hdfs://mycluster/HIDDEN.csv, start=2415919104, length=64898816, fileSize=2480817920, hosts=[datanode-dc1r04n02.hostname.tld, datanode-dc1r05n04.hostname.tld, datanode-dc1r05n01.hostname.tld], database=HIDDEN, table=HIDDEN, forceLocalScheduling=false, partitionName=<UNPARTITIONED>} (start = 1.3107605343549324E10, wall = 9488 ms, cpu = 3177 ms, wait = 41 ms, calls = 9) is finished
2017-09-26T12:23:06.020Z DEBUG 20170926_122256_00014_mcw2u.1.10-95-66 com.facebook.presto.execution.executor.TaskExecutor Split 20170926_122256_00014_mcw2u.1.10-95 {path=hdfs://mycluster/HIDDEN.csv, start=134217728, length=67108864, fileSize=2536431376, hosts=[datanode-dc1r04n02.hostname.tld, datanode-dc1r04n03.hostname.tld, datanode-dc1r04n01.hostname.tld], database=HIDDEN, table=HIDDEN, forceLocalScheduling=false, partitionName=<UNPARTITIONED>} (start = 1.3107614829074875E10, wall = 3 ms, cpu = 0 ms, wait = 0 ms, calls = 1) is finished
2017-09-26T12:23:06.068Z DEBUG task-notification-2 com.facebook.presto.execution.TaskStateMachine Task 20170926_122256_00014_mcw2u.1.10 is FAILED
2017-09-26T12:23:06.093Z DEBUG task-notification-2 com.facebook.presto.execution.executor.TaskExecutor Task finished or failed 20170926_122256_00014_mcw2u.1.10j
[...]
is it possible close() in one of the thread is closing the filesystem. There more than one place close() is called like the following https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/rcfile/RcFilePageSource.java#L135 https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/orc/OrcPageSource.java#L137 or closable in HdfsOrcDataSource, HdfsRcFileDataSource cause a problem?
This should be fixed by https://github.com/prestodb/presto/pull/9079
We had to revert the fix for 0.185 since it didn't work. The new one in #9079 will be in 0.186 (assuming no problems).
@vloup I add https://github.com/prestodb/presto-hadoop-apache2/issues/20 code to master branch , and it seem fix the first query on start time "fsnf" question . But after that , in my cluster , after sometime like one day , exception "at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.ipc.RemoteException: GSS initiate failed" will happened again .
@electrum At my cluster #9079 has a problem , at the first query after nodes start . Some thread got the filesystem and the run well ,And when other thread get filesystem , I don't know why "if (!fileSystemHolder.getPrivateCredentials().equals(privateCredentials)) " do , but it create a new filesystem . Because of that , the old filesystem will be gc ,and some thread still use it .
@electrum @qgxiaozhan Looking into it.
I upgraded our development cluster from 0.185 to 0.187 and, after a full day of uptime, after restarting some nodes before and after the first ticker renewal lifetime, I got no "Filesystem closed". Thanks for the fix and everybody involved to solve it!
I tried with 0.188, and "Filesystem closed" error did not happen again. Thanks!
It looks like it is closed, so it could be closed. Closing then ;)
Anyone noticing this issue on version 0.273.2
?
Dear team
I got "Filesystem closed" error almost every time after restarting Presto cluster. I asked about this error message in Presto user group, and found some other users also have the same problem. https://groups.google.com/forum/#!topic/presto-users/62vuYPhcXDM
Enviroment: Presto 0.167t.0.2, or Presto 0.182 with Kerberized Hive 2.x/HDFS 2.7.3.
Error in presto CLI:
Error in worker server.log:
Coordinator error:
This SQL failed at first after Presto cluster restarted, but after several retrials, the same SQL can run successfully.
My Question:
Why did this happen?
How to prevent it from happening again?