salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Enable Html stripping #478

Closed michaelweilsalesforce closed 4 years ago

michaelweilsalesforce commented 4 years ago

Related issues

When engineering features from a Text (and Text-like) raw features, we should strip the text of any html tags, which doesn't add signal to existing tokens (and even pollutes them).

Describe the proposed solution

Enable html stripping via TextTokenizer.AnalyzerHtmlStrip

michaelweilsalesforce commented 4 years ago

This PR doesn't introduce options yet

codecov[bot] commented 4 years ago

Codecov Report

Merging #478 into master will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #478    +/-   ##
========================================
  Coverage   87.00%   87.01%            
========================================
  Files         345      345            
  Lines       11673    11680     +7     
  Branches      388      613   +225     
========================================
+ Hits        10156    10163     +7     
  Misses       1517     1517            
Impacted Files Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala 67.64% <ø> (ø)
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 82.19% <100.00%> (+0.24%) :arrow_up:
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100.00% <100.00%> (ø)
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.20% <100.00%> (+0.03%) :arrow_up:
...esforce/op/stages/impl/feature/TextTokenizer.scala 97.36% <100.00%> (+0.14%) :arrow_up:
...sforce/op/stages/impl/feature/Transmogrifier.scala 98.05% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update eba38a0...97b9ce8. Read the comment docs.

TuanNguyen27 commented 4 years ago

@leahmcguire @Jauntbox could you take a look at this PR ? I'm not sure how to test my changes :(

gerashegalov commented 4 years ago

Let us actually fill out the form for the PR description to set the context :)