salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Detecting names in text fields (deprecated) #428

Closed MWYang closed 4 years ago

MWYang commented 4 years ago

SmartTextVectorizer now has an optional flag detectSensitive that will guess, using a combination of dictionary lookup and conditional logic, whether any of the input columns are names (which we don't want in our models in case of bias). For right now, just a warning is logged to console that there may be such names in the input fields. In the future, the removeSensitive flag will remove those columns from contributing to the output vector. Also in the future, the gender information that is extracted from name columns (using government data) will be used to check for model fairness.

A unary estimator HumanNameIdentifier is also included as a standalone drop-in for custom workflows.

Additional context I completed this work as part of my ongoing Salesforce internship.

salesforce-cla[bot] commented 4 years ago

Thanks for the contribution! It looks like @MWYang is an internal user so signing the CLA is not required. However, we need to confirm this.

codecov[bot] commented 4 years ago

Codecov Report

Merging #428 into master will decrease coverage by 4.86%. The diff coverage is 78.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #428      +/-   ##
==========================================
- Coverage   86.93%   82.07%   -4.87%     
==========================================
  Files         337      340       +3     
  Lines       11100    11375     +275     
  Branches      366      376      +10     
==========================================
- Hits         9650     9336     -314     
- Misses       1450     2039     +589
Impacted Files Coverage Δ
...scala/com/salesforce/op/utils/text/TextUtils.scala 42.85% <0%> (-57.15%) :arrow_down:
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 68.49% <0%> (-14.61%) :arrow_down:
.../scala/com/salesforce/op/features/types/Maps.scala 77.77% <0%> (-15%) :arrow_down:
.../main/scala/org/apache/spark/util/SparkUtils.scala 0% <0%> (ø) :arrow_up:
...ala/com/salesforce/op/features/types/package.scala 44.59% <0%> (-13.34%) :arrow_down:
...n/scala/com/salesforce/op/testkit/RandomText.scala 98.41% <0%> (-1.59%) :arrow_down:
.../scala/com/salesforce/op/features/types/Text.scala 84% <0%> (-10.37%) :arrow_down:
...sforce/op/stages/impl/feature/Transmogrifier.scala 96.62% <100%> (-1.4%) :arrow_down:
...com/salesforce/op/features/FeatureSparkTypes.scala 99.14% <100%> (+0.01%) :arrow_up:
...sforce/op/features/types/FeatureTypeDefaults.scala 96.19% <100%> (+0.07%) :arrow_up:
... and 64 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ccc1501...9ecad2f. Read the comment docs.

tovbinm commented 4 years ago

@MWYang thanks for the contribution. I would appreciate to get some context about the proposed changes in the PR description.

tovbinm commented 4 years ago

@MWYang why do we need a new FeatureType for Name? what's so specific about that Text cannot handle?

MWYang commented 4 years ago

@MWYang why do we need a new FeatureType for Name? what's so specific about that Text cannot handle?

You're right, there's nothing special about Name. At the beginning of my ideation process, I had thought that it would make sense to have a different output type for the fields that get flagged as names, but it's no longer a part of the feature. Now, it seems the preferred way to store information about which fields are names is in the metadata, which is what I'm working towards.

Should I revert the changes for creating the new Name feature type then?

MWYang commented 4 years ago

I'm closing because I made #440, which reduces the PR size by removing out a different feature that I didn't mean to commit into this branch and by removing the extraneous Name types, per @gerashegalov's suggestion. Would appreciate a review of the new PR so I can know what to work on!

tovbinm commented 4 years ago

Thank you @MWYang I will have a look.