twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

InputSizeEstimator should use right paths to estimate size. #1653

Open dieu opened 7 years ago

dieu commented 7 years ago

to discuss the right implementation of #1652.

Original problem:

  1. original implementation was based on tap.getPath, but some sources return the wrong path of source files/dirs because internal implementation of this sources needs JobConf to compute the right path (for instance VKVS).

Basically, it should be implemented inside tap.getSize(JobConf), but unfortunately, cascading implementation of getSize in Hfs tap do calculations without respect of paths with glob patterns (it will raise java.io.FileNotFoundException for path like /logs/*).

A Safer approach to address this:

  1. Calculate input size only for Hfs taps
  2. Ask tap.getSize(JobConf) as the first step, if Tap understands how to return the right size of the source, we should use it.
  3. If tap.getSize(JobConf) return 0 or raise any exception, we should try to calculate size by ourselves.

the question is:

  1. If we can not calculate the size of Hfs taps, should we fail? or we should disable InputSizeEstimator and fallback to defaults?
  2. If we can not calculate the size only for one of Hfs tap?

cc: @isnotinvain / @johnynek / @piyushnarang

piyushnarang commented 7 years ago

The worries with fallbacks in case of issues calculating the size of a given Hfs is that it's less likely for us to catch issues (not too many users will be paying attention to their logs and seeing 1 or n taps failed to be size estimated) and I think like we've discussed in different contexts, it might result in different number of reducers in different runs. I think the other risk with swallowing 1 of n tap's failing to estimate is that it might have been the tap with the largest amount of data.

Would it make sense to allow users to make the fallback to defaults explicit if they really want it? If they don't say they want a defaults fallback, we fail in case of issues?

isnotinvain commented 7 years ago

How about we fix Hfs tap's .getSize method? Or create a sub-class of Hfs like GlobHfs whose .getSizes method works correctly?

I don't think we should ignore exceptions. Best guesses usually lead to more headaches for us than failing fast.

dieu commented 7 years ago

@isnotinvain

  1. Hfs tap is cacading code, I'm not sure that we can fix it.
  2. Create GlobHfs is possible, but then we should change everywhere usage of Hfs to GlobHfs, that hard to test and more dangerous.
dieu commented 7 years ago

@isnotinvain Try to adopt your suggestion about sub-class of Hfs. please check again #1652