salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Remove names from SmartTextVectorizer; Add metadata features for sensitive fields #437

Closed MWYang closed 4 years ago

MWYang commented 5 years ago

Related issues This PR completes the work that #440 started (itself a rework of #428). First merge #440 before this PR.

Describe the proposed solution The optional flag DetectAndRemove in SmartTextVectorizer now works properly.

A new SensitiveFeatureInformation is now an optional attribute on FeatureInsight objects. Individual estimators, like SmartTextVectorizer (implemented in this PR), can write sensitive feature information to the metadata, which will then be extracted into FeatureInsight objects by ModelInsights.

codecov[bot] commented 5 years ago

Codecov Report

Merging #437 into master will decrease coverage by 9.58%. The diff coverage is 32.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #437      +/-   ##
==========================================
- Coverage   86.93%   77.34%   -9.59%     
==========================================
  Files         337      340       +3     
  Lines       11096    11346     +250     
  Branches      362      603     +241     
==========================================
- Hits         9646     8776     -870     
- Misses       1450     2570    +1120
Impacted Files Coverage Δ
...scala/com/salesforce/op/utils/text/TextUtils.scala 42.85% <0%> (-57.15%) :arrow_down:
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 65.27% <0%> (-17.83%) :arrow_down:
.../scala/com/salesforce/op/features/types/Maps.scala 77.77% <0%> (-15%) :arrow_down:
...orce/op/utils/stages/NameIdentificationUtils.scala 0% <0%> (ø)
...n/scala/com/salesforce/op/testkit/RandomText.scala 98.41% <0%> (-1.59%) :arrow_down:
...e/op/stages/impl/feature/HumanNameIdentifier.scala 0% <0%> (ø)
...com/salesforce/op/features/FeatureSparkTypes.scala 96.15% <100%> (-2.99%) :arrow_down:
...sforce/op/features/types/FeatureTypeDefaults.scala 49.03% <100%> (-47.08%) :arrow_down:
...c/main/scala/com/salesforce/op/ModelInsights.scala 93.06% <100%> (+0.31%) :arrow_up:
...e/op/stages/impl/feature/SmartTextVectorizer.scala 56.84% <25%> (-42.01%) :arrow_down:
... and 91 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e45073d...173bd57. Read the comment docs.

MWYang commented 4 years ago

Closing to rework on comments from reviewers.