globalmentor / hadoop-bare-naked-local-fs

GlobalMentor Hadoop local FileSystem implementation directly accessing the Java API without Winutils.
40 stars 7 forks source link

GlobalMentor Hadoop Bare Naked Local FileSystem

A Hadoop local FileSystem implementation directly accessing the Java API without Winutils, suitable for use with Spark.

The name of this project refers to the BareLocalFileSystem and NakedLocalFileSystem classes, and is a lighthearded reference to the Hadoop RawLocalFileSystem class which NakedLocalFileSystem extends—a play on the Portuguese expression, "a verdade, nua e crua" ("the raw, naked truth").

Usage

  1. If you have an application that needs Hadoop local FileSystem support without relying on Winutils, import the latest com.globalmentor:hadoop-bare-naked-local-fs library into your project, e.g. in Maven for v0.1.0:

    <dependency>
    <groupId>com.globalmentor</groupId>
    <artifactId>hadoop-bare-naked-local-fs</artifactId>
    <version>0.1.0</version>
    </dependency>
  2. Then specify that you want to use the Bare Local File System implementation com.globalmentor.apache.hadoop.fs.BareLocalFileSystem for the file scheme. (BareLocalFileSystem internally uses NakedLocalFileSystem.) The following example does this for Spark in Java:

    SparkSession spark = SparkSession.builder().appName("Foo Bar").master("local").getOrCreate();
    spark.sparkContext().hadoopConfiguration().setClass("fs.file.impl", BareLocalFileSystem.class, FileSystem.class);

_Note that you may still get warnings that "HADOOPHOME and hadoop.home.dir are unset" and "Did not find winutils.exe". This is because the Winutils kludge permeates the Hadoop code and is hard-coded at a low-level, executed statically upon class loading, even for code completely unrelated to file access. See HADOOP-13223: winutils.exe is a bug nexus and should be killed with an axe.

Limitations

Background

The Apache Hadoop FileSystem was designed tightly coupled to unix file systems. It assumes POSIX file permissions. Its model definition is still sparse. Little thought was put into creating a general file access API that could be implemented across platforms. File access on non nix systems such as Windows was largely ignored and few cared.

Unfortunately the Hadoop FileSystem API has become somewhat of a de-facto common file storage layer for big data processing, essentially tying big data to *nix systems if local file access is desired. For example, Apache Spark pulls in Hadoop's FileSystem (and the entire Spark client access layer) to write output files to the local file system. Running Spark on Windows, even for prototyping, would be impossible without a Windows implementation of FileSystem.

RawLocalFileSystem, accessed indirectly via LocalFileSystem, is Hadoop's attempt at Java access of a local file system. It written before Java added access to nix-centric features such as POSIX file permissions. RawLocalFileSystem attempts to access the local file system using system libraries via JNI, and if that is not possible falls back to creating Shell processes that run nix commands such as chmod or bash. This in itself represents a security concern, not to mention an inefficient kludge.

In order to allow RawLocalFileSystem to function on Windows (for example to run Spark), one Hadoop contributor created the winutils package. This represents a set of binary files that run on Windows and "intercept" the RawLocalFileSystem native calls. While the ability to run Spark on Windows was of course welcome, the design represents one more kludge on top of an existing kludge, requiring the trust of more binary distributions and another potential vector for malicious code. (For these reasons there are Hadoop tickets such as HADOOP-13223: winutils.exe is a bug nexus and should be killed with an axe.)

This Hadoop Bare Naked Local File System project bypasses Winutils and forces Hadoop to access the file system via pure Java. The BareLocalFileSystem and NakedLocalFileSystem classes are versions of LocalFileSystem and RawLocalFileSystem, respectively, which bypass the outdated native and shell access to the local file system and use the Java API instead. It means that projects like Spark can access the file system on Windows as well as on other platforms, without the need to pull in some third-party kludge such as Winutils.

Implementation Caveats (problems brought by LocalFileSystem and RawLocalFileSystem)

This Hadoop Bare Naked Local File System implementation extends LocalFileSystem and RawLocalFileSystem and "reverses" or "undoes" as much as possible JNI and shell access. Much of the original Hadoop kludge implementation is still present beneath the surface (meaning that "Bare Naked" is for the moment a bit of a misnomer).

Unfortunately solving the problem of Hadoop's default local file system accessing isn't as simple as just changing native/shell calls to their modern Java equivalents. The current LocalFileSystem and RawLocalFileSystem implementations have evolved haphazardly, with halway-implemented features scattered about, special-case code for ill-documented corner cases, and implementation-specific assumptions permeating the design itself. Here are a few examples.