salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Experiment with OpenNLP Language detection #512

Closed gerashegalov closed 3 years ago

tovbinm commented 4 years ago

Let us know if it works better or worse than Optimaize one. And please update the description on the PR explaining the motivation behind it.

tovbinm commented 4 years ago

It seems that it has more languages supported. Is this correct?

codecov[bot] commented 4 years ago

Codecov Report

Merging #512 into master will decrease coverage by 61.05%. The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #512       +/-   ##
===========================================
- Coverage   86.74%   25.68%   -61.06%     
===========================================
  Files         347      349        +2     
  Lines       11859    11886       +27     
  Branches      388      612      +224     
===========================================
- Hits        10287     3053     -7234     
- Misses       1572     8833     +7261     
Impacted Files Coverage Δ
...lesforce/op/stages/impl/feature/LangDetector.scala 0.00% <0.00%> (-100.00%) :arrow_down:
.../op/stages/impl/feature/NameEntityRecognizer.scala 0.00% <0.00%> (-100.00%) :arrow_down:
...esforce/op/stages/impl/feature/TextTokenizer.scala 0.00% <0.00%> (-97.37%) :arrow_down:
...sforce/op/utils/text/OpenNLPLanguageDetector.scala 0.00% <0.00%> (ø)
...a/com/salesforce/op/utils/text/OpenNLPModels.scala 0.00% <0.00%> (-97.62%) :arrow_down:
...orce/op/utils/text/OptimaizeLanguageDetector.scala 0.00% <0.00%> (-90.91%) :arrow_down:
...om/salesforce/op/utils/text/LanguageDetector.scala 0.00% <0.00%> (ø)
...main/scala/com/salesforce/op/dsl/RichFeature.scala 0.00% <0.00%> (-100.00%) :arrow_down:
...main/scala/com/salesforce/op/filters/Summary.scala 0.00% <0.00%> (-100.00%) :arrow_down:
.../scala/com/salesforce/op/cli/gen/ProblemKind.scala 0.00% <0.00%> (-100.00%) :arrow_down:
... and 210 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4d46181...6f42be6. Read the comment docs.

Jauntbox commented 4 years ago

Cool - curious to see how this compares. Another one we could try is FastText, which also has a language detection module: https://github.com/facebookresearch/fastText/#full-documentation