Open torvalds-dev-testbot[bot] opened 2 months ago
Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds
Based on your scenario of trying to perform row-level deletion in a Hudi table using the write operation delete, it is crucial to understand why the rows are not getting deleted despite no errors being encountered. Here's some guidance to help you debug this scenario:
In the context of write operations in Hudi, the Upsert flow for a CoW table involves preparing records into "update" and "insert" buckets before persisting them into storage. For updates, new FileSlices are created, while inserts lead to new FileGroups being generated. Source URL: https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-310
The Delete flow in Hudi is akin to a special case of the Upsert flow where input records are transformed into HoodieKeys to identify records for deletion. This results in a hard delete, effectively removing the target records from the table. Source URL: https://blog.datumagic.com/p/apache-hudi-from-zero-to-one-310
When using the write operation 'delete' in Hudi, ensure that your DataFrame contains Hudi's meta fields or a configured key field for successful deletion. The save mode should be set to 'Append' for the deletion operation. Source URL: https://datacouch.medium.com/getting-started-with-apache-hudi-711b89c107aa
By leveraging the detailed insights provided in these documentation snippets, you can review your deletion process and potentially identify any discrepancies in your approach that might be causing the rows not to be deleted as expected.
Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/252
Tips before filing an issue
Describe the problem you faced I am attempting row-level deletion using the write operation delete in my COW table. When executing my PySpark job, no rows are deleted despite confirming correct filtering by printing rows before deletion. No errors are reported, prompting me to seek debugging assistance.
To Reproduce Steps to reproduce the behavior:
Expected behavior I expect the specified rows to be deleted based on the provided configuration.
Environment Description
Additional context
{'hoodie.datasource.write.precombine.field': 'time', 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'partition_path', 'hoodie.cleaner.commits.retained': 0, 'hoodie.table.name': 'table_name', 'hoodie.datasource.write.operation': 'delete'}
Stacktrace Add the stacktrace of the error here.