salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Remove duplicate features using sanity checker feature to feature correlations #476

Closed leahmcguire closed 4 years ago

leahmcguire commented 4 years ago

Related issues Refer to issue(s) addressed in this pull request from Issues page.

Describe the proposed solution When features are exact duplicates of each other we would like to remove one before modeling

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context about the changes here.

codecov[bot] commented 4 years ago

Codecov Report

Merging #476 into master will increase coverage by 0.03%. The diff coverage is 95.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #476      +/-   ##
==========================================
+ Coverage   87.00%   87.03%   +0.03%     
==========================================
  Files         345      345              
  Lines       11643    11671      +28     
  Branches      376      614     +238     
==========================================
+ Hits        10130    10158      +28     
  Misses       1513     1513              
Impacted Files Coverage Δ
...ala/com/salesforce/op/dsl/RichNumericFeature.scala 100.00% <ø> (ø)
...rce/op/stages/impl/preparators/SanityChecker.scala 91.25% <90.90%> (-0.24%) :arrow_down:
...tages/impl/preparators/SanityCheckerMetadata.scala 89.72% <95.12%> (+0.74%) :arrow_up:
...c/main/scala/com/salesforce/op/ModelInsights.scala 93.04% <100.00%> (-0.07%) :arrow_down:
...s/impl/preparators/DerivedFeatureFilterUtils.scala 93.08% <100.00%> (+0.31%) :arrow_up:
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.79% <0.00%> (+4.08%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 24cdbc4...ac1c3a7. Read the comment docs.

nicodv commented 4 years ago

LGTM

salesforce-cla[bot] commented 3 years ago

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire l***@s***.com Leah McGuire l***@s***.com. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.