microsoft / hyperspace

An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.
https://aka.ms/hyperspace
Apache License 2.0
424 stars 115 forks source link

Use FileContext api instead of FileSystem for atomic renames (in IndexLogManager). #26

Open apoorvedave1 opened 4 years ago

apoorvedave1 commented 4 years ago

Describe the issue

Operation Log in Hyperspace relies on 'atomic rename' of log files to support concurrent operations. These operations use org.apache.hadoop.fs.FileSystem.rename() api which doen't provide atomicity guarantees as strong as org.apache.hadoop.fs.FileContext.rename()

Expected behavior

Better atomicity guarantee

More Details

From org.apache.spark.sql.execution.streaming.CheckpointFileManager, which also relies on atomic renames of checkpoints (similar to atomic renames of hyperspace operation logs),

// Try to create a [checkpoint] manager based on FileContext [instead of FileSystem] because HDFS's FileContext.rename()
// gives atomic renames, which is what we rely on for the default implementation
// CheckpointFileManager.createAtomic`

Environment

NA

clee704 commented 3 years ago

It seems FileContext.rename has the same semantics: the atomicity is implementation-dependent. Source: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileContext.html#rename-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Options.Rename...-

I assume this is not a serious problem, as HDFS requires atomic rename for any HDFS compatible file systems: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html#Atomicity