prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.92k stars 5.33k forks source link

Can Presto insert 5+ billion rows of data from an external text table into a Presto table using Hive? #20780

Open minhancao opened 1 year ago

minhancao commented 1 year ago

Environment details

Using docker setup with MinIO storage, Hive, Presto coordinator, and Presto worker.

Usage question

Hello! I am trying to insert from an external table that has 5+ billion rows of pipe-delimited data into a Presto table using Hive. The following image below shows it was able to read the 5.82B rows of data but had error committing into Hive. If anyone has any insight on this error or any other ways to tackle inserting 5.82B rows of data from an external table into a Presto table in PARQUET format, please let me know, thanks!

image
tdcmeehan commented 1 year ago

This needs more information--what is the stack trace of the error? You'll find it from your query logs and in the UI.

minhancao commented 1 year ago

@tdcmeehan Here is the stack trace, please let me know if you need anything else:

com.facebook.presto.spi.PrestoException: Error committing write to Hive
    at com.facebook.presto.hive.RecordFileWriter.commit(RecordFileWriter.java:186)
    at com.facebook.presto.hive.HiveWriter.commit(HiveWriter.java:102)
    at com.facebook.presto.hive.HivePageSink.doFinish(HivePageSink.java:237)
    at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
    at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:81)
    at com.facebook.presto.hive.HivePageSink.finish(HivePageSink.java:228)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSink.finish(ClassLoaderSafeConnectorPageSink.java:74)
    at com.facebook.presto.operator.TableWriterOperator.finish(TableWriterOperator.java:292)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:451)
    at com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:311)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:732)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:304)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1079)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:603)
    at com.facebook.presto.$gen.Presto_0_282_fadfd95____20230906_173105_1.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
    at com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3OutputStream.uploadObject(PrestoS3FileSystem.java:1296)
    at com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3OutputStream.close(PrestoS3FileSystem.java:1244)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
    at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.close(HadoopPositionOutputStream.java:64)
    at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1106)
    at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:132)
    at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:164)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.close(ParquetRecordWriterWrapper.java:127)
    at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.close(ParquetRecordWriterWrapper.java:144)
    at com.facebook.presto.hive.ParquetRecordWriterUtil$1.close(ParquetRecordWriterUtil.java:95)
    at com.facebook.presto.hive.RecordFileWriter.commit(RecordFileWriter.java:181)
    ... 18 more
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1219)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1165)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)
    at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3630)
    at com.amazonaws.services.s3.transfer.internal.CompleteMultipartUpload.call(CompleteMultipartUpload.java:94)
    at com.amazonaws.services.s3.transfer.internal.CompleteMultipartUpload.call(CompleteMultipartUpload.java:37)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    ... 3 more
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.base/java.net.SocketInputStream.socketRead0(Native Method)
    at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
    at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
    at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:82)
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1346)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
    ... 16 more