salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

changed to look specifically for part files rather than listing the files and searching through them #505

Closed leahmcguire closed 4 years ago

leahmcguire commented 4 years ago

Related issues Refer to issue(s) addressed in this pull request from Issues page.

Describe the proposed solution A clear and concise description of what the changes are.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context about the changes here.

tovbinm commented 4 years ago

Is it faster? What if another compression technique is being used, i.e. 7Zip, then we would have .7z extension?

codecov[bot] commented 4 years ago

Codecov Report

Merging #505 into master will decrease coverage by 0.70%. The diff coverage is 58.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #505      +/-   ##
==========================================
- Coverage   86.74%   86.03%   -0.71%     
==========================================
  Files         347      347              
  Lines       11859    11860       +1     
  Branches      388      607     +219     
==========================================
- Hits        10287    10204      -83     
- Misses       1572     1656      +84     
Impacted Files Coverage Δ
...cala/com/salesforce/op/OpWorkflowModelReader.scala 91.20% <58.33%> (-5.46%) :arrow_down:
...e/op/stages/impl/selector/RandomParamBuilder.scala 0.00% <0.00%> (-94.45%) :arrow_down:
...main/scala/com/salesforce/op/dsl/RichFeature.scala 50.00% <0.00%> (-50.00%) :arrow_down:
...mpl/regression/OpGeneralizedLinearRegression.scala 50.00% <0.00%> (-26.93%) :arrow_down:
...op/stages/impl/regression/OpXGBoostRegressor.scala 13.46% <0.00%> (-13.47%) :arrow_down:
...tages/impl/preparators/SanityCheckerMetadata.scala 84.45% <0.00%> (-5.41%) :arrow_down:
...s/impl/preparators/DerivedFeatureFilterUtils.scala 88.67% <0.00%> (-4.41%) :arrow_down:
...com/salesforce/op/utils/stages/FitStagesUtil.scala 89.47% <0.00%> (-3.95%) :arrow_down:
...rce/op/stages/impl/preparators/SanityChecker.scala 88.93% <0.00%> (-1.64%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4d46181...3d036d2. Read the comment docs.

leahmcguire commented 4 years ago

@gerashegalov suggested this for faster reading

gerashegalov commented 4 years ago

Is it faster? What if another compression technique is being used, i.e. 7Zip, then we would have .7z extension?

@tovbinm it should be faster on object stores since you have fast path lookups but slow entry list performance.

leahmcguire commented 4 years ago

@tovbinm this covers no compression and our default settings as well as passing in the exact full path. Do you think we need to cover more possibilities?

gerashegalov commented 4 years ago

this covers no compression and our default settings as well as passing in the exact full path. Do you think we need to cover more possibilities?

I don't think we need to make it that complex. At most I would make Compression Codec configurable instead of doing any guesses at all.