Open thiningsun opened 3 years ago
Host Details : local host is: "java.net.UnknownHostException: zmbd-pm41: zmbd-pm41: System error"; destination host is: "zmbd-pm-server01":8020;
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1864147273-172.20.3.102-1555051764064:blk_3253715368_2180099334 file=/user/hive/warehouse/tmp.db/tmp_user_cr_transfer_detail_prod/part-00188-5a35191d-1732-4755-b73f-8086ec9298b9-c000.deflate
@thiningsun As per the stack trace, it looks like presto nodes are not able to communicate your namenode hosts. Can you try telnet to your namenode hosts from your Presto hosts, to check communication is working fine?
background: We scaled the presto cluster from 4 to 32,The machine behind is mixed with hdfs(the first 4 independent deployment),Since then, there have been several errors when reading HDFS presto version : presto-server-0.216
(1)Error opening Hive split hdfs://nameservice1/user/hive/warehouse/dw.db/dwd_seller_call_df_bp/pt=2021-03-30/000086_0 (offset=33554432, length=67108864): Failed on local exception: java.net.SocketException: Too many open files; Host Details : local host is: "java.net.UnknownHostException: zmbd-pm41: zmbd-pm41: System error"; destination host is: "zmbd-pm-server01":8020;
`com.facebook.presto.spi.PrestoException: Error reading from hdfs://nameservice1/user/hive/warehouse/dw.db/dwd_lesson_deduct_df/pt=2021-04-11/part-00048-61ba0c6b-dd90-419e-bee7-02041736f556-c000 at position 18083424 at com.facebook.presto.hive.orc.HdfsOrcDataSource.readInternal(HdfsOrcDataSource.java:77) at com.facebook.presto.orc.AbstractOrcDataSource.readFully(AbstractOrcDataSource.java:105) at com.facebook.presto.orc.AbstractOrcDataSource.readFully(AbstractOrcDataSource.java:96) at com.facebook.presto.orc.OrcReader.(OrcReader.java:120)
at com.facebook.presto.orc.OrcReader.(OrcReader.java:80)
at com.facebook.presto.hive.orc.OrcPageSourceFactory.createOrcPageSource(OrcPageSourceFactory.java:192)
at com.facebook.presto.hive.orc.OrcPageSourceFactory.createPageSource(OrcPageSourceFactory.java:121)
at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:161)
at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:95)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:221)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:379)
at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483)
at com.facebook.presto.$gen.Presto_0_216____20210412_005931_1.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Too many open files; Host Details : local host is: "java.net.UnknownHostException: zmbd-pm69: zmbd-pm69: System error"; destination host is: "zmbd-vm03":8020; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) at org.apache.hadoop.ipc.Client.call(Client.java:1480) at org.apache.hadoop.ipc.Client.call(Client.java:1413) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy186.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255) at sun.reflect.GeneratedMethodAccessor226.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy187.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272) at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1004) at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1083) at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1439) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1402) at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78) at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107) at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107) at com.facebook.presto.hive.orc.HdfsOrcDataSource.readInternal(HdfsOrcDataSource.java:64) ... 22 more
Caused by: java.net.SocketException: Too many open files at sun.nio.ch.Net.socket0(Native Method) at sun.nio.ch.Net.socket(Net.java:411) at sun.nio.ch.Net.socket(Net.java:404) at sun.nio.ch.SocketChannelImpl.(SocketChannelImpl.java:105)
at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
at java.nio.channels.SocketChannel.open(SocketChannel.java:145)
at org.apache.hadoop.net.StandardSocketFactory.createSocket(StandardSocketFactory.java:62)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:590)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:713)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
at org.apache.hadoop.ipc.Client.call(Client.java:1452)
... 45 more`
Along with other errors:
(2)Could not obtain block: BP-1864147273-172.20.3.102-1555051764064:blk_3191566124_2117944057 file=/user/hive/warehouse/tmp.db/bi_tmk_cc_achievement_team_all_hf/part-00104-7c7324e0-5c34-484e-95f6-ad701648ffa0-c000.deflate `com.facebook.presto.spi.PrestoException: Could not obtain block: BP-1864147273-172.20.3.102-1555051764064:blk_3253715368_2180099334 file=/user/hive/warehouse/tmp.db/tmp_user_cr_transfer_detail_prod/part-00188-5a35191d-1732-4755-b73f-8086ec9298b9-c000.deflate at com.facebook.presto.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:227) at com.facebook.presto.hive.HiveRecordCursor.advanceNextPosition(HiveRecordCursor.java:175) at com.facebook.presto.$gen.CursorProcessor_20210412_025803_7891.process(Unknown Source) at com.facebook.presto.operator.ScanFilterAndProjectOperator.processColumnSource(ScanFilterAndProjectOperator.java:242) at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:234) at com.facebook.presto.operator.Driver.processInternal(Driver.java:379) at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283) at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675) at com.facebook.presto.operator.Driver.processFor(Driver.java:276) at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077) at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162) at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483) at com.facebook.presto.$gen.Presto_0_216____20210412_005931_1.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1864147273-172.20.3.102-1555051764064:blk_3253715368_2180099334 file=/user/hive/warehouse/tmp.db/tmp_user_cr_transfer_detail_prod/part-00188-5a35191d-1732-4755-b73f-8086ec9298b9-c000.deflate at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:976) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:632) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:874) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:926) at java.io.DataInputStream.read(DataInputStream.java:149) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:200) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:237) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:193) at org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48) at com.facebook.presto.hive.GenericHiveRecordCursor.advanceNextPosition(GenericHiveRecordCursor.java:209) ... 15 more`
(3)Error reading from hdfs://nameservice1/user/hive/warehouse/dw.db/dwd_teach_lesson_df/pt=2021-03-30/000015_0 at position 252643031
The strange thing is that when we only have 4 nodes, we never report this errors. These errors all appear after we expand the presto node (32 machines).
There is also a strange phenomenon. If the hdfs file is damaged, then the same sql query multiple times will definitely report an error. Execute the failed sql again, it returns to me success, it seems that the error is random,It doesn’t always fail. I use hdfs command to cat the damaged file.(hdfs fs -cat /hdfs://nameservice1/user/hive/warehouse/dw.db/dwd_teach_lesson_df/pt=2021-03-30/000015_0) It is good and can be opened.
Can anyone help me ? thanks