spotify / snakebite

A pure python HDFS client
Apache License 2.0
858 stars 216 forks source link

[Feature request] Short Circuits Reads #48

Open ogrisel opened 10 years ago

ogrisel commented 10 years ago

This is a feature request to add the possibility to leverage HDFS Short Circuit Reads from snakebite:

Since HDFS-347 was fixed in Hadoop 2.1.0, it is possible for HDFS clients collocated with a DataNode to directly access pre-opened file descriptors for the blocks of HDFS files hosted by the data datanode, making it possible for the client to read raw data without any overhead nor memory copy.

More details in this blog post:

http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/

This feature is even more interesting when combined with the new CentralizedCacheManagement of Hadoop 2.3.0 that makes it possible to warm-up a and lock the blocks of specific HDFS files in the underlying operating system disk cache.

The two features combined make it possible to do full in-memory access of HDFS data directly from Python (optionally by the Python's mmap module or numpy.memmap).

So this feature request is about adding SCR support to snakebite only.

Exposing cached block info from the name node via RPC access along with caching directive control in some new snakebite functions would also be interesting to fully leverage collocation of the execution of the HDFS client program with cached blocks. However this can be probably be implemented independently from the SCR feature.

ogrisel commented 10 years ago

Note that file descriptors can only be passed via Unix domain sockets via specific system calls (sendmsg on the data node side, recvmsg on the client side). This is possible in pure Python as of 3.3: http://docs.python.org/3.4/library/socket.html#socket.socket.recvmsg

ogrisel commented 10 years ago

BTW: the motivating use case would be to schedule Python applications for data analytics that would run collocated on a HDFS cluster via the Apache Hadoop YARN or Apache Mesos cluster schedulers. Both schedulers provide protobuf / RPC interface that can make the Python code cluster aware and therefore leverage data locality (disk locality in Hadoop 2.1+ and even memory-locality in Hadoop 2.3+).

rdn32 commented 8 years ago

This [reading file descriptors via recvmsg] is possible in pure Python as of 3.3

It is possible to call recvmsg in earlier versions of Python as well, even though it isn't exposed as part of the socket module. I've had a crack at implementing stuff to do this with ctypes here: https://gist.github.com/rdn32/eb2b823451187b5db363