twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Estimate input size only for Hfs and GlobHfs taps #1652

Closed dieu closed 7 years ago

piyushnarang commented 7 years ago

@dieu should we add a test for the file not found scenario? Or are they already covered?

dieu commented 7 years ago

@piyushnarang I'm not sure that we can run this tests, because hadoop itself will raise error.

piyushnarang commented 7 years ago

Ok, can we test it manually then to ensure it works as expected?

dieu commented 7 years ago

@piyushnarang I added tests for FileNotFound

2017-03-09 12:39:41,848 WARN  [pool-47-thread-1] reducer_estimation.InputSizeReducerEstimator$ (InputSizeReducerEstimator.scala:estimateReducersWithoutRounding(34)) - InputSizeReducerEstimator unable to estimate reducers; cannot compute size of one of (usually it's memory taps or files not found):
 - Hfs["TextLine[['offset', 'line']->[ALL]]"]["file.txt"]
2017-03-09 12:39:41,860 INFO  [pool-47-thread-1] flow.FlowStep (BaseFlowStep.java:logInfo(834)) - [com.twitter.scalding.r...] starting step: (1/1) counts.tsv
2017-03-09 12:39:42,402 INFO  [flow com.twitter.scalding.reducer_estimation.SimpleFileNotFoundJob] flow.Flow (BaseFlow.java:logInfo(1378)) - [com.twitter.scalding.r...] stopping all jobs
2017-03-09 12:39:42,403 INFO  [flow com.twitter.scalding.reducer_estimation.SimpleFileNotFoundJob] flow.FlowStep (BaseFlowStep.java:logInfo(834)) - [com.twitter.scalding.r...] stopping: (1/1) counts.tsv
2017-03-09 12:39:42,408 INFO  [flow com.twitter.scalding.reducer_estimation.SimpleFileNotFoundJob] flow.Flow (BaseFlow.java:logInfo(1378)) - [com.twitter.scalding.r...] stopped all jobs
2017-03-09 12:39:42,956 ERROR [ResourceManager Event Processor] resourcemanager.ResourceManager (ResourceManager.java:run(594)) - Returning, interrupted : java.lang.InterruptedException
dieu commented 7 years ago

@isnotinvain rewrote to more safer way.

isnotinvain commented 7 years ago

@dieu before I forget, you mentioned this would break all normal HFS instances, so we need to handle that too

dieu commented 7 years ago

@isnotinvain no, we handle existing HFS instances, it why I use Try to getSize on tap, because cascading Hfs doesn't handle glob patterns.

dieu commented 7 years ago

@piyushnarang / @isnotinvain / @johnynek please review.

johnynek commented 7 years ago

👍