linkedin / photon-ml

A scalable machine learning library on Apache Spark
Other
790 stars 185 forks source link

Dataframe #451

Open lguo opened 4 years ago

lguo commented 4 years ago

The following changes in the design doc are covered.

  1. Training datasets will be created directly before training a coordinate.
  2. Residuals will be computed by using a UDF on the training DataFrame. For random effects, the per-entity models will first need to be joined to the DataFrame by REID. A single UDF will do all scoring for fixed and random effects at once. This UDF will also sum the residuals and offsets. Directly before aggregation, the DataFrame will be converted to a RDD, and then aggregation will proceed unmodified.
  3. Model scoring will work like coordinate scoring;
  4. Random effect vector projection will be disabled.
ashelkovnykov commented 4 years ago

Forgot to comment - since all of these commits are related to one task and don't seem to have any logical separation, would you kindly crush them into one commit.

lguo commented 4 years ago

I skipped reviewing much of the scoring changes as they looked like they were still early WIP and subject to many changes.

FixedEffectCoordinate.updateOffset (and RandomEffectCoordinate.updateOffset) are used to compute scores instead merging scores back to original dataset.