salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Fix writing of model in the local file system #528

Closed sanmitra closed 4 years ago

sanmitra commented 4 years ago

Related issues The model loading in a cluster was failing with error - op-model.json/part-00000 did not exist.

Additional context In #516 we made the changes to save model locally and convert to zip before moving to remote path. So the path passed to the below method saveImpl is of a local file system, hence when we are trying to save it as a HadoopFile, op-model.json/part-00000 is not being written out.

override protected def saveImpl(path: String): Unit = {
    JobGroupUtil.withJobGroup(OpStep.ModelIO) {
      sc.parallelize(Seq(toJsonString(path)), 1)
        .saveAsTextFile(OpWorkflowModelReadWriteShared.jsonPath(path), classOf[GzipCodec])
    }(this.sparkSession)
  }
salesforce-cla[bot] commented 4 years ago

Thanks for the contribution! It looks like @ijeri is an internal user so signing the CLA is not required. However, we need to confirm this.

codecov[bot] commented 4 years ago

Codecov Report

Merging #528 (7501ab5) into master (bc507f2) will increase coverage by 4.26%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #528      +/-   ##
==========================================
+ Coverage   82.47%   86.74%   +4.26%     
==========================================
  Files         347      347              
  Lines       11952    11956       +4     
  Branches      377      384       +7     
==========================================
+ Hits         9858    10371     +513     
+ Misses       2094     1585     -509     
Impacted Files Coverage Δ
...cala/com/salesforce/op/OpWorkflowModelWriter.scala 100.00% <100.00%> (ø)
...sforce/op/stages/impl/feature/Transmogrifier.scala 98.05% <0.00%> (+0.83%) :arrow_up:
...esforce/op/features/types/FeatureTypeFactory.scala 99.13% <0.00%> (+0.86%) :arrow_up:
...la/com/salesforce/op/features/FeatureBuilder.scala 35.17% <0.00%> (+1.37%) :arrow_up:
...la/com/salesforce/op/stages/OpPipelineStages.scala 63.88% <0.00%> (+1.38%) :arrow_up:
...rce/op/stages/impl/preparators/SanityChecker.scala 90.57% <0.00%> (+1.63%) :arrow_up:
...scala/com/salesforce/op/testkit/RandomStream.scala 100.00% <0.00%> (+2.38%) :arrow_up:
...op/stages/DefaultOpPipelineStageReaderWriter.scala 69.23% <0.00%> (+2.56%) :arrow_up:
.../op/stages/impl/feature/PercentileCalibrator.scala 96.87% <0.00%> (+3.12%) :arrow_up:
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 82.19% <0.00%> (+4.10%) :arrow_up:
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bc507f2...f0db8d9. Read the comment docs.

gerashegalov commented 4 years ago

@ijeri can you give an example of the local path? Do you mean because the path is not absolute?

Saving RDD to a local path should work even if the local filesystem is not the default one:

scala> sc.hadoopConfiguration.set("fs.defaultFS", "hdfs://127.0.0.1/")
scala> sc.parallelize(Seq(1,2,3)).saveAsTextFile("file:///tmp/test5")
gerashegalov commented 4 years ago

+1 to stop gzipping op-model json since we now zip the whole bundle

sanmitra commented 4 years ago

@gerashegalov the file path is a qualified one for ex- file:/a.../b..../op-model.json

gerashegalov commented 4 years ago

makes sense: after rdd is distributed partitions are written out to the worker node local dirs.

sanmitra commented 4 years ago

@gerashegalov I have addressed your comments. Please take a look again.

salesforce-cla[bot] commented 4 years ago

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Sanmitra Ijeri s***@s***.com. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.