Open lxynov opened 4 years ago
In order to resolve this, I feel we might have to revert the changes in https://github.com/prestosql/presto/pull/1812 although the reversion will make it inefficient to drop a large number of partitions. @rohangarg , @electrum , @sopel39 , could you comment on this?
@lxynov let me rephrase to check whether i understand. We delete two partitions:
A=val_a/B=val_b1
A=val_a/B=val_b2
and the metastore does equivalent of rmdir -p A=val_a/B=val_b1
and rmdir -p A=val_a/B=val_b2
(in threads, because we call it in threads).
The ordering of events can cause both threads to realize A=val_a/
is now empty and should be removed.
Am i correct?
Since partition deletions can be caused by different queries, it feels legal that these coincide on the metastore side. So failure do delete properly seems to be the bug on metastore side. Is it already reported?
I checked and HMS does try to delete empty parents for a partition directory (here). Also, the hive dir. deletion part tries to ignore FNFE but the exception a lot of times isn't captured/passed properly in the delete or rename calls. I am not sure about why the parent deletion was done in the HMS though.
Also, @lxynov I think your exception shows error in rename because of skipTrash
property not set in hive metastore.
For a workaround currently, you can try setting the hive.max-concurrent-metastore-drops
property to 1.
Without the metastore change, for parallel drops I think we would require the knowledge of storage locations of partitions to avoid this issue.
Thanks @findepi for the comment. Yes your description is accurate. And I think your argument is valid. Ideally HMS should be able to handle such cases.
Thanks @rohangarg for looking into this. Good to know that hive does try to ignore FNFE when deleting directories. We're running Hive 1.2 so the behavior might be different than the latest one. Thanks for your suggestions on skipTrash
and hive.max-concurrent-metastore-drops
. I'll try these options as of now.
Some DELETE queries are failing with a
The transaction didn't commit cleanly
error.Queries to reproduce the issue:
Stacktrace in Presto:
Error logs in
hcat.err
:I think the issue is because multiple
DROP PARTITION
requests are sent to HMS in parallel. Consider a table partitioned by column A and B. Its file structure may be likeWhen a
DELETE FROM table_name WHERE A = a0
is executed, multipleDROP PARTITION
requests are sent to HMS in parallel. For eachDROP PARTITION
request, according to the observed behavior, HMS will try to delete sub-directory (A=a0/B=b0
) firstly, and then delete directoryA=a0
if it's empty. However, there's a race condition when executing in parallel. DirectoryA=a0
may be deleted in multipleDROP PARTITION
requests in parallel, which leads to theFileNotFoundException
shown inhcat.err
.Although the query failure doesn't affect query correctness (the directory is deleted cleanly), its occurrence prevents one of our scheduled workflows from continuing.