numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Hi, Iam using Spark to write to hudi table but gets error: 23/12/28 18:51:39 INFO DAGScheduler: Job 8 finished: collectAsMap at HoodieSparkEngineContext #176

Open torvalds-dev-testbot[bot] opened 10 months ago

torvalds-dev-testbot[bot] commented 10 months ago

Tips before filing an issue

Describe the problem you faced Hi, Iam using Spark to write to hudi table but gets error:

23/12/28 18:51:39 INFO DAGScheduler: Job 8 finished: collectAsMap at HoodieSparkEngineContext.java:164, took 0.089813 s 23/12/28 18:51:39 WARN WriteMarkers: Error deleting marker directory for instant 00000000000000010 org.apache.hudi.exception.HoodieIOException: <s3a://raw-bucket/bronze/.hoodie/metadata/.hoodie/.temp/00000000000000010>': Directory is not empty at org.apache.hudi.common.fs.FSUtils.deleteDir(FSUtils.java:720) at org.apache.hudi.table.marker.DirectWriteMarkers.deleteMarkerDir(DirectWriteMarkers.java:82) at org.apache.hudi.table.marker.WriteMarkers.quietDeleteMarkerDir(WriteMarkers.java:147) at org.apache.hudi.client.BaseHoodieWriteClient.postCommit(BaseHoodieWriteClient.java:567) at org.apache.hudi.client.BaseHoodieWriteClient.postWrite(BaseHoodieWriteClient.java:545) at org.apache.hudi.client.SparkRDDWriteClient.bulkInsertPreppedRecords(SparkRDDWriteClient.java:239) at org.apache.hudi.client.SparkRDDWriteClient.bulkInsertPreppedRecords(SparkRDDWriteClient.java:63) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.commitInternal(HoodieBackedTableMetadataWriter.java:1129) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.bulkCommit(SparkHoodieBackedTableMetadataWriter.java:130) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:445) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:278) at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.&lt;init&gt;(HoodieBackedTableMetadataWriter.java:182) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.&lt;init&gt;(SparkHoodieBackedTableMetadataWriter.java:95) at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:72) at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:287) at org.apache.hudi.client.SparkRDDWriteClient.initMetadataTable(SparkRDDWriteClient.java:273) at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1256) at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1296) at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:139) at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:224) at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:431) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512) at <http://org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org|org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org>$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$super$transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:133) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at com.zdp.core.batch.service.impl.HudiDataManager.insert(HudiDataManager.scala:78) at StorageConnector$.main(StorageConnector.scala:23) at StorageConnector.main(StorageConnector.scala) Caused by: org.apache.hadoop.fs.PathIsNotEmptyDirectoryException:s3a://raw-bucket/bronze/.hoodie/metadata/.hoodie/.temp/00000000000000010': Directory is not empty

Below is my Hudi options and mode is overwrite


  "hoodie.table.name" -&gt; configurationDTO.tableName,
  "hoodie.datasource.write.recordkey.field" -&gt; "emp_id",
  "hoodie.datasource.write.table.name" -&gt; configurationDTO.tableName,
  //"hoodie.datasource.write.operation" -&gt; "upsert",
  "hoodie.datasource.write.precombine.field" -&gt; "ts",
  "hoodie.upsert.shuffle.parallelism" -&gt; "2",
  "hoodie.insert.shuffle.parallelism" -&gt; "2",
  RECORDKEY_FIELD.key() -&gt; "emp_id",
  PARTITIONPATH_FIELD.key() -&gt; "state,department"
)```
Dependencies

```"org.apache.spark" %% "spark-core" % "3.2.0",
"org.apache.spark" %% "spark-sql" % "3.2.0",
"org.apache.spark" %% "spark-streaming" % "3.2.0",
"org.apache.hudi" %% "hudi-spark3.2-bundle" % "0.14.0",
"org.apache.hadoop" % "hadoop-common" % "3.3.1",
"org.apache.hadoop" % "hadoop-client" % "3.3.1",
"org.apache.avro" % "avro" % "1.10.2",
"org.apache.avro" % "avro-mapred" % "1.10.2" % "test",
"org.apache.avro" % "avro-tools" % "1.10.2" % "test",
"com.lihaoyi" %% "ujson" % "3.1.2",
"org.apache.hadoop" % "hadoop-aws" % "3.3.1",
"com.amazonaws" % "aws-java-sdk" % "1.12.622"```

Iam using Hudi 0.14
A clear and concise description of the problem.

**To Reproduce**

Steps to reproduce the behavior:

1.
2.
3.
4.

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version :

* Spark version :

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) :

* Running on Docker? (yes/no) :

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```