Closed michaelkamprath closed 2 years ago
Merging #256 (5a2528f) into master (5f1ddfd) will decrease coverage by
0.11%
. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #256 +/- ##
==========================================
- Coverage 55.19% 55.08% -0.12%
==========================================
Files 304 304
Lines 93812 93813 +1
==========================================
- Hits 51784 51675 -109
- Misses 42028 42138 +110
Impacted Files | Coverage Δ | |
---|---|---|
src/cc/meta/IdempotentRequestTracker.cc | 72.84% <0.00%> (-6.47%) |
:arrow_down: |
src/cc/meta/Restorer.cc | 69.98% <0.00%> (-1.05%) |
:arrow_down: |
src/cc/chunk/DiskIo.cc | 66.25% <0.00%> (-0.98%) |
:arrow_down: |
src/cc/libclient/KfsNetClient.cc | 64.50% <0.00%> (-0.96%) |
:arrow_down: |
src/cc/meta/NetDispatch.cc | 75.45% <0.00%> (-0.91%) |
:arrow_down: |
src/cc/meta/MetaRequestHandler.cc | 79.59% <0.00%> (-0.67%) |
:arrow_down: |
src/cc/libclient/Reader.cc | 71.71% <0.00%> (-0.63%) |
:arrow_down: |
src/cc/meta/ClientSM.cc | 69.50% <0.00%> (-0.54%) |
:arrow_down: |
src/cc/kfsio/IOBuffer.cc | 72.85% <0.00%> (-0.51%) |
:arrow_down: |
src/cc/meta/ChunkServer.cc | 57.78% <0.00%> (-0.47%) |
:arrow_down: |
... and 10 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 5f1ddfd...5a2528f. Read the comment docs.
Thank you, Michael.
The hadoop shim change looks good to me. However code duplication in unit tests is concerning.
As far as I can tell the QFSEmulationImpl.java under hadoop-qfs2 is an exact copy of the same file under hadoop-qfs. I think that such duplication would create code maintanece problems.
TestQuantcastFileSystem2.java looks similar to TestQuantcastFileSystem.java with the IO test methods removed and testHDFSCompatibility(), and instantiating QuantcastFileSystem2 instead of QuantcastFileSystem. Perhaps there is a way to re-use TestQuantcastFileSystem by deriving TestQuantcastFileSystem2 from it?
At the moment I don’t recall how hadoop unit tests are handled for different versions, if at all. I’ll have to take a closer look at this to see if there is a way to address problems with code duplication in unit tests.
I think for now, unless there is a simple way to address the unit tests problem, it might be reasonable to go with only adding required method to the hadoop shim, and rework unit tests later.
As far as I can tell the QFSEmulationImpl.java under hadoop-qfs2 is an exact copy of the same file under hadoop-qfs. I think that such duplication would create code maintanece problems.
You are right. To test the new code, I needed to pull in QuantcastFileSystem2
when building for Hadoop 2 but not when building for Hadoop 0.23 given the way the javabuild.sh
script works. The easiest way to do this was to create a new test case under hadoop-qfs-2
, and I did simply copy pasta. I can remove QFSEmulationImpl.java
under hadoop-qfs2
and the test still works fine.
TestQuantcastFileSystem2.java looks similar to TestQuantcastFileSystem.java with the IO test methods removed and testHDFSCompatibility(), and instantiating QuantcastFileSystem2 instead of QuantcastFileSystem. Perhaps there is a way to re-use TestQuantcastFileSystem by deriving TestQuantcastFileSystem2 from it?
Yeah, I can do this. Even with inheriting from TestQuantcastFileSystem
, we will still need a setUp()
method in TestQuantcastFileSystem2
to create the QuantcastFileSystem2
object rather than the QuantcastFileSystem
object. The other side effect of this is all of the TestQuantcastFileSystem
tests will be additionally be run against the QuantcastFileSystem2
instance, but I think that is actually a good thing.
I think for now, unless there is a simple way to address the unit tests problem, it might be reasonable to go with only adding required method to the hadoop shim, and rework unit tests later.
I think it is good process to try to create unit tests, but if these changes still don't create a good result, we can certainly remove them.
I'm wondering that perhaps getSheme() was missing from the beginning, and it really belongs to QuantcastFileSystem? To put it another way, is there any downside to adding this method to QuantcastFileSystem, like this:
diff --git a/src/java/hadoop-qfs/src/main/java/com/quantcast/qfs/hadoop/QuantcastFileSystem.java b/src/java/hadoop-qfs/src/main/java/com/quantcast/qfs/hadoop/QuantcastFileSystem.java
index 6428faba..0dc199c8 100644
--- a/src/java/hadoop-qfs/src/main/java/com/quantcast/qfs/hadoop/QuantcastFileSystem.java
+++ b/src/java/hadoop-qfs/src/main/java/com/quantcast/qfs/hadoop/QuantcastFileSystem.java
@@ -61,6 +61,10 @@ public class QuantcastFileSystem extends FileSystem {
return uri;
}
+ public String getScheme() {
+ return uri.getScheme();
+ }
+
public void initialize(URI uri, Configuration conf) throws IOException {
super.initialize(uri, conf);
setConf(conf);
@mikeov The reason the getScheme()
method must go in QuantcastFileSystem2
is because if it was in QuantcastFileSystem
the v0.23 Hadoop shims will not build. I am assuming that interface wasn't defined in Hadoop 0.23 API, but admittedly I didn't dig deeply to figure out when getScheme()
was introduced.
Thanks!
Thank you, Michael. It looks to me that 0.23.11 builds OK with getScheme(), but 0.23.4 does not build with or without it with modern enough compiler. I suspect that this is a different problem. I'll probably move it into QuantcastFileSystem, and perhaps get rid of 0.23.4. Most likely 0.23.4 was effectively removed from all builds except centos 5, and the later is no longer presently supported anyway.
Some packages in Apache Spark, notably Spark NLP, attempt to download models or data to the default filesystem of the Spark deployment. When doing this, the package will call the configured file system's
getScheme()
method (documentation) to construct an URI for the path to save the downloaded content to. The current Hadoop filesystem connectorQuantcastFileSystem2
does not implement this simple method. This pull request fixed that plus implements a test case for this method.This build has been tested in my deployment of Spark 3.2.1 on Debian 9 with the following (relevant) configurations: