salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Changing blocklist policy for Sequence stages #524

Closed michaelweilsalesforce closed 3 years ago

michaelweilsalesforce commented 3 years ago

Related issues Example : If a SequenceEstimator/Transformer with input features Seq(f1, f2, f3) has a f1 as a blocklist, then, because Seq(f2, f3) and Seq(f1, f2, f3) don't have the same size, the SequenceEstimator/Transformer will be removed when updating the DAG. However those sequence stages should ignore if one or more original inputs are missing.

Describe the proposed solution When updating the DAG, sequence stages with updated inputs with length different than 0 will be kept.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context This problem was addressed when we witnessed a result feature being part of the blocklist after updating the DAG. We acknowledge the possibility to change the policy in RawFeatureFilter.

salesforce-cla[bot] commented 3 years ago

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

codecov[bot] commented 3 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@13ad9cd). Click here to learn what that means. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #524   +/-   ##
=========================================
  Coverage          ?   86.73%           
=========================================
  Files             ?      347           
  Lines             ?    11961           
  Branches          ?      630           
=========================================
  Hits              ?    10374           
  Misses            ?     1587           
  Partials          ?        0           
Impacted Files Coverage Δ
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 88.59% <100.00%> (ø)
...sforce/op/stages/base/unary/UnaryTransformer.scala 100.00% <0.00%> (ø)
...op/stages/impl/tuning/OpTrainValidationSplit.scala 100.00% <0.00%> (ø)
...ala/com/salesforce/op/test/TempDirectoryTest.scala 82.00% <0.00%> (ø)
...esforce/op/stages/impl/CheckIsResponseValues.scala 75.00% <0.00%> (ø)
...ala/com/salesforce/op/utils/io/csv/CSVToAvro.scala 87.87% <0.00%> (ø)
...ala/com/salesforce/op/testkit/RandomIntegral.scala 100.00% <0.00%> (ø)
...com/salesforce/op/test/TestOpWorkflowBuilder.scala 100.00% <0.00%> (ø)
.../salesforce/op/stages/impl/tuning/DataCutter.scala 97.22% <0.00%> (ø)
...com/salesforce/op/testkit/ProbabilityOfEmpty.scala 100.00% <0.00%> (ø)
... and 338 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 13ad9cd...8c2ab5e. Read the comment docs.

michaelweilsalesforce commented 3 years ago

My code still need some improvement. However do you folks agree with this change?

michaelweilsalesforce commented 3 years ago

@nicodv @Jauntbox Please review. Thanks.

michaelweilsalesforce commented 3 years ago

Weird. Not able to reproduce the bug anymore. It might have been fixed. Closing PR