salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Avoid creating SparseVectors for LOCO #377

Closed gerashegalov closed 5 years ago

gerashegalov commented 5 years ago

Related issues

376

Describe the proposed solution reuse the original SparseVector as a mutable template

Additional context In a scoring job:

before LOCO with record insights with record insights after PR
< 10s 168 sec 11sec
leahmcguire commented 5 years ago

Wow!

codecov[bot] commented 5 years ago

Codecov Report

Merging #377 into master will decrease coverage by <.01%. The diff coverage is 90.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #377      +/-   ##
==========================================
- Coverage   86.83%   86.83%   -0.01%     
==========================================
  Files         336      336              
  Lines       10955    10957       +2     
  Branches      347      577     +230     
==========================================
+ Hits         9513     9514       +1     
- Misses       1442     1443       +1
Impacted Files Coverage Δ
...ala/com/salesforce/op/utils/spark/RichVector.scala 84.61% <0%> (-15.39%) :arrow_down:
...e/op/stages/impl/insights/RecordInsightsLOCO.scala 96.77% <100%> (+0.1%) :arrow_up:
...m/salesforce/op/evaluators/EvaluationMetrics.scala 86.66% <0%> (-0.84%) :arrow_down:
...op/stages/impl/selector/ModelSelectorSummary.scala 92.55% <0%> (+0.71%) :arrow_up:
...es/src/main/scala/com/salesforce/op/OpParams.scala 89.79% <0%> (+4.08%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update dc64b4f...087b1d2. Read the comment docs.

michaelweilsalesforce commented 5 years ago

Wait, if we apply foreachActive on a dense vector, wouldn't this look at ALL the element of the vectors?

gerashegalov commented 5 years ago

it's still WIP but I have this guard if oldVal != 0.0 in the pattern match @michaelweilsalesforce

tovbinm commented 5 years ago

Neat. What about memory complexity?

gerashegalov commented 5 years ago

reworked the solution to avoid the memory overhead of the dense vector.

gerashegalov commented 5 years ago

Thanks Michael!

On Wed, Aug 7, 2019 at 5:11 PM Michael Weil notifications@github.com wrote:

@michaelweilsalesforce commented on this pull request.

In core/src/main/scala/com/salesforce/op/stages/impl/insights/RecordInsightsLOCO.scala https://github.com/salesforce/TransmogrifAI/pull/377#discussion_r311811639 :

  • agggregateDiffs(0, Left(featureSparse), indexToExamine, minMaxHeap, aggregationMap,
  • baseScore)

LEt me write a fix that will not go over the zeros

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/salesforce/TransmogrifAI/pull/377?email_source=notifications&email_token=AAYKJYSL3NJYROY5C2RIPFTQDNQEZA5CNFSM4IH6XKZKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCA5M2TY#discussion_r311811639, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYKJYWIBVUMCL35MSVTNC3QDNQEZANCNFSM4IH6XKZA .

michaelweilsalesforce commented 5 years ago

@gerashegalov Here is a proposal that skips the diffs for zero values. Code can be nicer though

gerashegalov commented 5 years ago

Thank you, looks good, just a few polishes

tovbinm commented 5 years ago

Shall we merge this one?

salesforce-cla[bot] commented 3 years ago

Thanks for the contribution! It looks like @mweilsalesforce is an internal user so signing the CLA is not required. However, we need to confirm this.

salesforce-cla[bot] commented 3 years ago

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire l***@s***.com. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.