Closed rohitkum2506 closed 2 months ago
Can you verify the scan work done by the query didn't result in full scan and update the testing done section?
Added query plan for new query. HDFS logs are not available yet for the test table I am working with. Will update testing section with HDFS actions associated with delete query when available.
General question: Do we align on letting users hold accountability of the columnPattern? Unwanted deletion can be easily avoided by looking at the partition table which is a metadata operation and doesn't hurt performance. Why don't we want to do the check?
General question: Do we align on letting users hold accountability of the columnPattern? Unwanted deletion can be easily avoided by looking at the partition table which is a metadata operation and doesn't hurt performance. Why don't we want to do the check?
@jiang95-dev Good question. It's more of an implementation gotcha. In order to avoid a record deletion because of invalid pattern, we still would have to cross validate every columnValue with invalid partitions, implementation wise something like:
substring(columnVal, 0, len(columnPattern)) IN (<list of invalid partitions>)
which again requires use of predicate function.
Getting all partitions is metadata ops but checking every data record against it will need data file reads. Happy to iterate on it if you have other ideas.
Summary
Deletion Logic in Retention Job had complex predicate to filter rows based on partitionColumns for String partitioned table. This leads to problems with delete ops:
overwrite
typeThe change:
Query Plan with new Query:
BatchScan shows that partitions with a specific value in filter
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
Mint Tests Fixed Unit tests to account for updated Query Ran tests with Local dataset to validate the Delete query deletes records as intended
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.