salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Detecting names in text fields #440

Closed MWYang closed 4 years ago

MWYang commented 4 years ago

SmartTextVectorizer now has an optional flag detectSensitive that will guess, using a combination of dictionary lookup and conditional logic, whether any of the input columns are names (which we don't want in our models in case of bias). For right now, just a warning is logged to console that there may be such names in the input fields. In the future, the removeSensitive flag will remove those columns from contributing to the output vector. Also in the future, the gender information that is extracted from name columns (using government data) will be used to check for model fairness.

A unary estimator HumanNameIdentifier is also included as a standalone drop-in for custom workflows.

Additional context I completed this work as part of my ongoing Salesforce internship. This PR cleans up and replaces #428.

codecov[bot] commented 4 years ago

Codecov Report

Merging #440 into master will decrease coverage by 3.76%. The diff coverage is 18.62%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #440      +/-   ##
==========================================
- Coverage   86.93%   83.17%   -3.77%     
==========================================
  Files         337      339       +2     
  Lines       11096    11296     +200     
  Branches      362      597     +235     
==========================================
- Hits         9646     9395     -251     
- Misses       1450     1901     +451
Impacted Files Coverage Δ
...orce/op/utils/stages/NameIdentificationUtils.scala 0% <0%> (ø)
...scala/com/salesforce/op/utils/text/TextUtils.scala 42.85% <0%> (-57.15%) :arrow_down:
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 69.44% <0%> (-13.66%) :arrow_down:
.../scala/com/salesforce/op/features/types/Maps.scala 77.77% <0%> (-15%) :arrow_down:
...n/scala/com/salesforce/op/testkit/RandomText.scala 98.41% <0%> (-1.59%) :arrow_down:
...e/op/stages/impl/feature/HumanNameIdentifier.scala 0% <0%> (ø)
...com/salesforce/op/features/FeatureSparkTypes.scala 99.14% <100%> (ø) :arrow_up:
...sforce/op/features/types/FeatureTypeDefaults.scala 96.15% <100%> (+0.03%) :arrow_up:
...e/op/stages/impl/feature/SmartTextVectorizer.scala 58.82% <26.92%> (-40.03%) :arrow_down:
...esforce/op/features/types/FeatureTypeFactory.scala 98.27% <50%> (-0.85%) :arrow_down:
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 9778481...01b7205. Read the comment docs.

MWYang commented 4 years ago

Closing to rework on comments from reviewers.