Open dieu opened 7 years ago
The worries with fallbacks in case of issues calculating the size of a given Hfs is that it's less likely for us to catch issues (not too many users will be paying attention to their logs and seeing 1 or n taps failed to be size estimated) and I think like we've discussed in different contexts, it might result in different number of reducers in different runs. I think the other risk with swallowing 1 of n tap's failing to estimate is that it might have been the tap with the largest amount of data.
Would it make sense to allow users to make the fallback to defaults explicit if they really want it? If they don't say they want a defaults fallback, we fail in case of issues?
How about we fix Hfs tap's .getSize method? Or create a sub-class of Hfs like GlobHfs whose .getSizes method works correctly?
I don't think we should ignore exceptions. Best guesses usually lead to more headaches for us than failing fast.
@isnotinvain
@isnotinvain Try to adopt your suggestion about sub-class of Hfs. please check again #1652
to discuss the right implementation of #1652.
Original problem:
tap.getPath
, but some sources return the wrong path of source files/dirs because internal implementation of this sources needsJobConf
to compute the right path (for instance VKVS).Basically, it should be implemented inside
tap.getSize(JobConf)
, but unfortunately, cascading implementation ofgetSize
inHfs
tap do calculations without respect of paths with glob patterns (it will raise java.io.FileNotFoundException for path like/logs/*
).A Safer approach to address this:
Hfs
tapstap.getSize(JobConf)
as the first step, if Tap understands how to return the right size of the source, we should use it.tap.getSize(JobConf)
return 0 or raise any exception, we should try to calculate size by ourselves.the question is:
Hfs
taps, should we fail? or we should disable InputSizeEstimator and fallback to defaults?Hfs
tap?cc: @isnotinvain / @johnynek / @piyushnarang