salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Make EmailVectorizer not clean the email domains by default. #426

Closed sanmitra closed 4 years ago

sanmitra commented 4 years ago

Related issues Users want to see the text of the email domain as an indicator variable, the way a true email address does (including punctuation) EG: Today, email field marc.benioff@salesforce.com will have a column with indicator value salesforcecom, because the text of the domain name has been cleaned. Instead we would like it say "salesforce.com"

Describe the proposed solution Added a case class CleanTextParams(ignoreCase: Boolean, cleanPunctuations: Boolean) which will give us more control on how the text is going to cleaned in general. In future more parameters can be added. By default across all features, this would be CleanTextParams(true, true) except for email/emailMap features, in which it would be CleanTextParams(true, false)

codecov[bot] commented 4 years ago

Codecov Report

Merging #426 into master will decrease coverage by 0.02%. The diff coverage is 80.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #426      +/-   ##
==========================================
- Coverage   77.93%   77.91%   -0.03%     
==========================================
  Files         337      337              
  Lines       11082    11101      +19     
  Branches      355      370      +15     
==========================================
+ Hits         8637     8649      +12     
- Misses       2445     2452       +7
Impacted Files Coverage Δ
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 61.97% <ø> (ø) :arrow_up:
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala 41.17% <ø> (ø) :arrow_up:
...ce/op/stages/impl/feature/OpOneHotVectorizer.scala 96.84% <100%> (+0.06%) :arrow_up:
...p/stages/impl/feature/TextMapPivotVectorizer.scala 100% <100%> (ø) :arrow_up:
...scala/com/salesforce/op/utils/text/TextUtils.scala 63.63% <60%> (-36.37%) :arrow_down:
...sforce/op/stages/impl/feature/Transmogrifier.scala 73.27% <92.3%> (+0.6%) :arrow_up:
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) :arrow_down:
.../op/features/types/FeatureTypeSparkConverter.scala 98.23% <0%> (-0.89%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b8bae1c...9136753. Read the comment docs.

sanmitra commented 4 years ago

@gerashegalov For now I am closing this PR since I am going to just turn off the email cleaning directly in AutoML and leave TMOG as it is. In future we can make changes to TMOG to provide more granular control on how the text is cleaned if there are many use-cases which require it.