quantcast / qfs

Quantcast File System
https://quantcast.atlassian.net
Apache License 2.0
643 stars 171 forks source link

Support for Hadoop getScheme() #256

Closed michaelkamprath closed 2 years ago

michaelkamprath commented 2 years ago

Some packages in Apache Spark, notably Spark NLP, attempt to download models or data to the default filesystem of the Spark deployment. When doing this, the package will call the configured file system's getScheme() method (documentation) to construct an URI for the path to save the downloaded content to. The current Hadoop filesystem connector QuantcastFileSystem2 does not implement this simple method. This pull request fixed that plus implements a test case for this method.

This build has been tested in my deployment of Spark 3.2.1 on Debian 9 with the following (relevant) configurations:

spark.hadoop.fs.qfs.impl            com.quantcast.qfs.hadoop.QuantcastFileSystem2
spark.hadoop.fs.defaultFS           qfs://qfs-master:20000
spark.hadoop.fs.qfs.metaServerHost  qfs-master
spark.hadoop.fs.qfs.metaServerPort  20000
codecov-commenter commented 2 years ago

Codecov Report

Merging #256 (5a2528f) into master (5f1ddfd) will decrease coverage by 0.11%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #256      +/-   ##
==========================================
- Coverage   55.19%   55.08%   -0.12%     
==========================================
  Files         304      304              
  Lines       93812    93813       +1     
==========================================
- Hits        51784    51675     -109     
- Misses      42028    42138     +110     
Impacted Files Coverage Δ
src/cc/meta/IdempotentRequestTracker.cc 72.84% <0.00%> (-6.47%) :arrow_down:
src/cc/meta/Restorer.cc 69.98% <0.00%> (-1.05%) :arrow_down:
src/cc/chunk/DiskIo.cc 66.25% <0.00%> (-0.98%) :arrow_down:
src/cc/libclient/KfsNetClient.cc 64.50% <0.00%> (-0.96%) :arrow_down:
src/cc/meta/NetDispatch.cc 75.45% <0.00%> (-0.91%) :arrow_down:
src/cc/meta/MetaRequestHandler.cc 79.59% <0.00%> (-0.67%) :arrow_down:
src/cc/libclient/Reader.cc 71.71% <0.00%> (-0.63%) :arrow_down:
src/cc/meta/ClientSM.cc 69.50% <0.00%> (-0.54%) :arrow_down:
src/cc/kfsio/IOBuffer.cc 72.85% <0.00%> (-0.51%) :arrow_down:
src/cc/meta/ChunkServer.cc 57.78% <0.00%> (-0.47%) :arrow_down:
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5f1ddfd...5a2528f. Read the comment docs.

mikeov commented 2 years ago

Thank you, Michael.

The hadoop shim change looks good to me. However code duplication in unit tests is concerning.

As far as I can tell the QFSEmulationImpl.java under hadoop-qfs2 is an exact copy of the same file under hadoop-qfs. I think that such duplication would create code maintanece problems.

TestQuantcastFileSystem2.java looks similar to TestQuantcastFileSystem.java with the IO test methods removed and testHDFSCompatibility(), and instantiating QuantcastFileSystem2 instead of QuantcastFileSystem. Perhaps there is a way to re-use TestQuantcastFileSystem by deriving TestQuantcastFileSystem2 from it?

At the moment I don’t recall how hadoop unit tests are handled for different versions, if at all. I’ll have to take a closer look at this to see if there is a way to address problems with code duplication in unit tests.

I think for now, unless there is a simple way to address the unit tests problem, it might be reasonable to go with only adding required method to the hadoop shim, and rework unit tests later.

michaelkamprath commented 2 years ago

As far as I can tell the QFSEmulationImpl.java under hadoop-qfs2 is an exact copy of the same file under hadoop-qfs. I think that such duplication would create code maintanece problems.

You are right. To test the new code, I needed to pull in QuantcastFileSystem2 when building for Hadoop 2 but not when building for Hadoop 0.23 given the way the javabuild.sh script works. The easiest way to do this was to create a new test case under hadoop-qfs-2, and I did simply copy pasta. I can remove QFSEmulationImpl.java under hadoop-qfs2 and the test still works fine.

TestQuantcastFileSystem2.java looks similar to TestQuantcastFileSystem.java with the IO test methods removed and testHDFSCompatibility(), and instantiating QuantcastFileSystem2 instead of QuantcastFileSystem. Perhaps there is a way to re-use TestQuantcastFileSystem by deriving TestQuantcastFileSystem2 from it?

Yeah, I can do this. Even with inheriting from TestQuantcastFileSystem, we will still need a setUp() method in TestQuantcastFileSystem2 to create the QuantcastFileSystem2 object rather than the QuantcastFileSystem object. The other side effect of this is all of the TestQuantcastFileSystem tests will be additionally be run against the QuantcastFileSystem2 instance, but I think that is actually a good thing.

I think for now, unless there is a simple way to address the unit tests problem, it might be reasonable to go with only adding required method to the hadoop shim, and rework unit tests later.

I think it is good process to try to create unit tests, but if these changes still don't create a good result, we can certainly remove them.

mikeov commented 2 years ago

I'm wondering that perhaps getSheme() was missing from the beginning, and it really belongs to QuantcastFileSystem? To put it another way, is there any downside to adding this method to QuantcastFileSystem, like this:

diff --git a/src/java/hadoop-qfs/src/main/java/com/quantcast/qfs/hadoop/QuantcastFileSystem.java b/src/java/hadoop-qfs/src/main/java/com/quantcast/qfs/hadoop/QuantcastFileSystem.java
index 6428faba..0dc199c8 100644
--- a/src/java/hadoop-qfs/src/main/java/com/quantcast/qfs/hadoop/QuantcastFileSystem.java
+++ b/src/java/hadoop-qfs/src/main/java/com/quantcast/qfs/hadoop/QuantcastFileSystem.java
@@ -61,6 +61,10 @@ public class QuantcastFileSystem extends FileSystem {
     return uri;
   }

+  public String getScheme() {
+    return uri.getScheme();
+  }
+
   public void initialize(URI uri, Configuration conf) throws IOException {
     super.initialize(uri, conf);
     setConf(conf);
michaelkamprath commented 2 years ago

@mikeov The reason the getScheme() method must go in QuantcastFileSystem2 is because if it was in QuantcastFileSystem the v0.23 Hadoop shims will not build. I am assuming that interface wasn't defined in Hadoop 0.23 API, but admittedly I didn't dig deeply to figure out when getScheme() was introduced.

Thanks!

mikeov commented 2 years ago

Thank you, Michael. It looks to me that 0.23.11 builds OK with getScheme(), but 0.23.4 does not build with or without it with modern enough compiler. I suspect that this is a different problem. I'll probably move it into QuantcastFileSystem, and perhaps get rid of 0.23.4. Most likely 0.23.4 was effectively removed from all builds except centos 5, and the later is no longer presently supported anyway.