[SUPPORT] Using spark structured streaming, reading from kafka and writing to a MoR hudi table, I can't get async clustering to work

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced Using spark structured streaming, reading from kafka and writing to a MoR hudi table, I can't get async clustering to work. When the clustering job runs after the nth commit, I get an error java.lang.IllegalArgumentException: For input string: "null" at scala.collection.immutable.StringLike.parseBoolean(StringLike.scala:336) followed by: java.util.concurrent.CompletionException: org.apache.spark.SparkException: Cannot find catalog plugin class for catalog 'spark_catalog': org.apache.spark.sql.hudi.catalog.HoodieCatalog (complete stack attached). The clustering jobs are failed, the batch it's on finishes but then does not continue to the next batch. Possibly related, I also see Cannot find catalog plugin class for catalog 'spark_catalog': errors when clustering inline but it retries the microbatch, apparently succeeds (no failed jobs appear in spark UI) and continues processing the next batch Any ideas what could be wrong?

I was using 0.13.1 and had those settings configured as specified above. One thing I found is that we were using the hudi-spark3 bundle with dataproc, which runs spark 3.3 not 3.4, so we're trying with the spark3.3 bundle instead. Also going to try experimenting with 0.14 to see how it looks. I'm guessing our problem is a spark cluster/configuration issue as this is obviously working for most folks. Greatly appreciate any other suggestions as to what might be causing this A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

numberlabs-developers / hudi

[SUPPORT] Using spark structured streaming, reading from kafka and writing to a MoR hudi table, I can't get async clustering to work #177