I am trying to load data from S3 which has arround 60 mil records using pySpark and I am getting the below error.
Environment
OS version: AWS EMR 6.3.0
ClickHouse Server version: 21.8.10
ClickHouse Native JDBC version: 2.6.5
(Optional) Spark version: version 3.1.1-amzn-0
Error logs
Caused by: com.github.housepower.exception.ClickHouseSQLException: DB::ExceptionDB::Exception: Unexpected packet Data received from client. Stack trace:
0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0x8fd5d9a in /usr/bin/clickhouse
1. DB::TCPHandler::receiveUnexpectedData() @ 0x1103b729 in /usr/bin/clickhouse
2. DB::TCPHandler::receivePacket() @ 0x11031232 in /usr/bin/clickhouse
3. DB::TCPHandler::runImpl() @ 0x110298ea in /usr/bin/clickhouse
4. DB::TCPHandler::run() @ 0x1103cd39 in /usr/bin/clickhouse
5. Poco::Net::TCPServerConnection::start() @ 0x13bb428f in /usr/bin/clickhouse
6. Poco::Net::TCPServerDispatcher::run() @ 0x13bb5d1a in /usr/bin/clickhouse
7. Poco::PooledThread::run() @ 0x13ce7f99 in /usr/bin/clickhouse
8. Poco::ThreadImpl::runnableEntry(void*) @ 0x13ce422a in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
at com.github.housepower.protocol.ExceptionResponse.readExceptionFrom(ExceptionResponse.java:36)
at com.github.housepower.protocol.Response.readFrom(Response.java:35)
at com.github.housepower.client.NativeClient.receiveResponse(NativeClient.java:183)
at com.github.housepower.client.NativeClient.receiveEndOfStream(NativeClient.java:134)
at com.github.housepower.jdbc.ClickHouseConnection.sendInsertRequest(ClickHouseConnection.java:294)
at com.github.housepower.jdbc.statement.ClickHousePreparedInsertStatement.close(ClickHousePreparedInsertStatement.java:167)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:695)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:856)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:854)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)
at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1020)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2278)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
I find this weird but when I am using single column in PARTITION BY then I am not getting this error, however when I am using multiple columns as partition, it is throwing me this error.
I am trying to load data from S3 which has arround 60 mil records using pySpark and I am getting the below error.
Environment
Error logs
Code Snippet
I find this weird but when I am using single column in PARTITION BY then I am not getting this error, however when I am using multiple columns as partition, it is throwing me this error.
Can anyone help me on this?