prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
15.89k stars 5.32k forks source link

On deleting a column from hudi table, its still present when querying from presto using hive connector #22704

Open sutodi opened 4 months ago

sutodi commented 4 months ago

On deleting a column from hudi table, its still present when querying from presto using hive connector. All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present

Your Environment

hudi-bundle: org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1 to create a hudi table syncing to metastore spark: 3.3.2

Expected Behavior

On deleting hudi column, i expected that column should not be present when querying from presto

Current Behavior

<All the rows have value null for the deleted column but it is present in output when querying select *. Also, on doing describe table, i see that column is still present

Possible Solution

TBD

Steps to Reproduce

  1. Create dataproc cluster and make following changes spark-default.conf spark.serializer=org.apache.spark.serializer.KryoSerializer spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED spark.sql.legacy.avro.datetimeRebaseModeInWrite=CORRECTED spark.sql.legacy.avro.datetimeRebaseModeInRead=CORRECTED

spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar

hive-site.xml hive.metastore.disallow.incompatible.col.type.changes = false

  1. Create a table from pyspark in hudi: from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType from datetime import datetime schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("surname", StringType(), True), StructField("ts", TimestampType(), True) # Adding timestamp field ]) data = [ (1, "John", "Doe", datetime.now()), (2, "Jane", "Smith", datetime.now()), (3, "Michael", "Johnson", datetime.now()), (4, "Emily", "Williams", datetime.now()) ]

df = spark.createDataFrame(data, schema) df = spark.createDataFrame(data, schema) df.write \ .format("org.apache.hudi") \ .option("hoodie.table.name", "hoodie_table") \ .option("hoodie.datasource.write.recordkey.field", "id") \ .option("hoodie.datasource.write.keyprefix", "ts") \ .option("hoodie.schema.on.read.enable","true") \ .mode("overwrite") \ .save("gs://xxxx/subham_test_metastore_13") spark.sql("CREATE TABLE default.subham_test_metastore_13 USING hudi LOCATION 'gs://xxxx/subham_test_metastore_13' ")

  1. Create spark-sql engine as pyspark use datasource v1 set hoodie.schema.on.read.enable=true

ALTER TABLE default.subham_test_metastore_11 DROP COLUMN surname;

On doing this, we get this error message as well, but data is getting dropped from hudi. Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :

  1. update a row in table from pyspark schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("ts", TimestampType(), True) # Adding timestamp field ]) data = [ (1,"Johny", datetime.now()) ] df = spark.createDataFrame(data, schema) df.write \ .format("org.apache.hudi") \ .option("hoodie.table.name", "hoodie_table") \ .option("hoodie.datasource.write.recordkey.field", "id") \ .option("hoodie.datasource.write.keyprefix", "ts") \ .option("hoodie.schema.on.read.enable","true") \ .mode("append") \ .save("gs://xxxx/subham_test_metastore_13")

  2. Now, from spark-sql and pyspark, we can see that column is not coming but it is coming when querying from presto.

Screenshots (if appropriate)

Context

We have lakehouse in Hudi and use presto with hive-connector to query hudi table. We want to delete column and facing problem there.

sutodi commented 4 months ago

@codope

imjalpreet commented 3 months ago

@sutodi Accessing Hudi tables using Presto's hive connector relies on the metadata present in the metastore. If the column still exists in the metastore, you would see it in the describe table output or if you do a SELECT * query. I can see there was an error when you tried to do drop column in spark, so it looks like the metadata in the metastore still has the dropped column. The reason you don't see the issue when querying from spark is probably because spark is reading the metadata from the hudi metadata in the filesystem.

Can you verify whether the column still exists in the metastore?

imjalpreet commented 3 months ago

@sutodi You can also try out the Presto Hudi Connector and see if you face the same issue. Ideally, you should see similar behaviour as spark when querying via the Presto Hudi Connector since it also relies on the metadata present in the filesystem. There are different ways to access your Hudi data when using the Hudi Connector. You can have a look at the docs for getting started: https://prestodb.github.io/docs/current/connector/hudi.html