h2oai / h2o-tutorials

Tutorials and training material for the H2O Machine Learning Platform
http://h2o.ai
1.48k stars 1.01k forks source link

How do I generate Arch features of new datasets from GLRM predict function #160

Open tsengj opened 2 years ago

tsengj commented 2 years ago

Raised the same question here;

https://stackoverflow.com/questions/72753783/how-do-i-generate-the-archetypes-of-new-dataset-from-the-glrm-predict-function.

I have used these sites as reference and though has been resourceful, I'm unable to regenerate the reduced dimensions of new datasets via the glrm predict function

I work in the Sparklyr environment with H2o. I'm keen to use the GLRM function to reduce dimensions to cluster. Though from the model, i am able to access the PCAs or Arch, i would like to generate the Archs from the GRLM predict function on new datasets.

Appreciate your help.

Here is the training of the GLRM model on the training dataset

glrm_model <-
  h2o.glrm(
    training_frame = train,
    cols = glrm_cols,
    loss = "Absolute",
    model_id = "rank2",
    seed = 1234,
    k = 5,
    transform = "STANDARDIZE",
    loss_by_col_idx = losses$index,
    loss_by_col = losses$loss
  )
# Decompose training frame into XY
X <- h2o.getFrame(glrm_model@model$representation_name) #as h2o frame

The Arch Types from the training dataset:

X
        Arch1      Arch2       Arch3      Arch4      Arch5
1  0.10141381 0.10958071  0.26773514 0.11584502 0.02865024
2  0.11471676 0.06489475  0.01407714 0.24536782 0.10223535
3  0.08848878 0.26742082  0.04915022 0.11693702 0.03530641
4 -0.03062604 0.29793032 -0.07003814 0.01927975 0.52451867
5  0.09497268 0.12915324  0.21392107 0.08574152 0.03750636
6  0.05857743 0.18863508  0.14570623 0.08695144 0.03448957

But when i wish use the trained GLRM model on new dataset to regenerate these arch types, I got the full dimensions instead of the Arch types as per X above?

I'm using these Arch as features for clustering purposes.

# Generate predictions on a validation set (if necessary):
glrm_pred <- h2o.predict(glrm_model, newdata = test)
glrm_pred
  reconstr_price reconstr_bedrooms reconstr_bathrooms reconstr_sqft_living reconstr_sqft_lot reconstr_floors reconstr_waterfront reconstr_view reconstr_condition reconstr_grade reconstr_sqft_above reconstr_sqft_basement reconstr_yr_built reconstr_yr_renovated
1     -0.8562455       -1.03334892         -1.9903167           -1.3950774        -0.2025564      -1.6537486                   0             4                  5             13         -1.20187061             -0.6584413       -1.25146116            -0.3042907
2     -0.7940549       -0.29723926         -0.7863867           -0.4364751        -0.1666500      -0.8527297                   0             4                  5             13         -0.13831432             -0.6545514        0.54821146            -0.3622902
3     -0.7499614       -0.18296317          0.1970824           -0.3989486        -0.1532677       0.4914559                   0             4                  5             13         -0.09100889             -0.6614534        1.38779632            -0.1844416
4     -1.0941432        0.08954988          0.7872987           -0.2087964        -0.1599888       0.8254916                   0             4                  5             13          0.11973488             -0.6623575        2.70176558            -0.2363486
5      0.3727360        0.82848389          0.4965246            1.1134378        -0.9013011      -1.3388791                   0             4                  5             13          0.08427185              2.1354440       -0.07213625            -1.2213866
6     -0.4042458       -0.59876839         -0.9685556           -0.7093578        -0.1745297      -0.5061798                   0             4                  5             13         -0.43503836             -0.6628391       -0.55165408            -0.2207544
  reconstr_lat reconstr_long reconstr_sqft_living15 reconstr_sqft_lot15
1  -0.07307503    -0.4258959             -1.0132778          -0.1964471
2  -0.52124543     0.7283153              0.1242903          -0.1295341
3  -0.56113519     0.6011221             -0.1616215          -0.1624136
4  -0.99759090     1.3032420              0.1556193          -0.1569607
5   0.70028433    -0.6436112              1.1400189          -0.9272790
6  -0.02222403    -0.2257382             -0.4859787          -0.1817499

[6416 rows x 18 columns] 

thank you

wendycwong commented 2 years ago

James: Thank you for bringing me the issue. @us8945 has also brought up a good question on how do we score a new data set using a trained GLRM model. Let me answer his question first here:

Given a training dataset, the purpose of GLRM is to extract a set of basis vectors that span the whole subspace where the training dataset is derived from. Hence, the GLRM model will generate a set of archetypes (which are equivalent to the concept of basis vectors) here. Hence, each row vector in the training dataset can be written as a linear combination of the archetypes as yi = x1archetype1 + x2 archetype2+x3*archetype3+... . Note here, the x1, x2, x3 are the coefficients that are returned when we call predict on the training dataset for each row. During training, we derive the archetypes and the coefficients together in an alternate way.

Now, given a new set of dataset derived from the same subspace that the training dataset is derived from, the job of the predict function here is find the set of coefficients for each data row using the archetypes that are derived earlier. Here, we already know the archetypes, only need to find the coefficients. This is achieved by setting initial values of coefficients to random values and then using simple gradient descend to minimize the objective function to obtained the correct objective function.

tsengj commented 2 years ago

Thanks Wendy, but unfortunately, this is past my depth level. Your response here made more sense, where you wrote;

"GLRM, you decompose a matrix A = XY and you perform clustering on X. For a new dataset, ANew, you need to get your new XNew. To do this, you perform XNew = ANew * inverse(Y)"

How do I implement this ANew * inverse(Y) in R in order for me to cluster on the features from XNew

Apologies, I'm quite a novice in this space.

wendycwong commented 2 years ago

James:

I know what you are looking for, the X for a new dataset. Luckily Uri (@us8945) has brought up the issue to me. I will write a new function for you in order for you to get the new X.

The predict function return Anew to you but you are looking for the new X.

Will get this done for you.

Thank you for bringing this to our attention.

W

wendycwong commented 2 years ago

Here is the JIRA: https://h2oai.atlassian.net/browse/PUBDEV-8750