h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

http://h2o.ai

Apache License 2.0

6.9k stars 2k forks source link

Choosing the smoothing parameter, λ, by cross validation for GAM #8461

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: The Simon Wood book can be found in this JIRA: [https://h2oai.atlassian.net/browse/PUBDEV-6781?jql=text%20~%20%22GAM%22|https://h2oai.atlassian.net/browse/PUBDEV-6781?jql=text%20~%20%22GAM%22|smart-link] Read section 4.2.3

I emailed you the other book, read section 3.4

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Karthik and I met on Oct 29, 2020 and had a good discussion. A summary of work topics are as follows:

Wendy has sent Karthik 3 documents on GAM and choosing smoothing parameters for GAM using cross-validation.

Generate three set of datasets with one numerical column which will be the gam column with family of gaussian, binomial and multinomial. Note the coefficients, knot locations and scale parameter used. Make sure you set lambda = 0, alpha = 0 when you generate the synthetic datasets. Choose a row number of 10000 should be good to start. Save it in a secure spot.

Add an option to GAM to allow automatic scale parameter setting.

Set initial estimate of scale parameter to be about 0.5 of the initial negative log likelihood part of the objective function.

calculate CV(lamda) of 3.19 for gaussian and CV(lambda) using n-fold instead of leave one out for other family and and compare the true CV(lambda) of 3.10 using the dataset generated in 2. Plot and check how different they are as a function of n.

Repeat 2/3 with predictors + GAM columns and see how well the two compares.

Repeat 2/3 with predictors and multiple GAM columns and see how well the two compared.

Reverse engineer R implementation and adopt them for H2O if possible.

compare H2O result with R.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Karthik: for all families, I am curious to see if a n-fold CV(lambda) will be close enough to CV(lambda) of 3.10 for what n?

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: November 6, 2020:

Meeting summary with karthik:

We decided to do the following:

Generate random dataset of one GAM column for Gaussian family

calculate cv(lambda) for various scale (lambda) values;

calculate 3.19 in this form:

!image-20201106-231310.png|width=576,height=83!

and report deviation of cv(lambda) 3.10 and 3.19. Generate the two for many scale (lambda) values and plot and compare.

Come up with a scheme to decide what scale(lamda) values to try.
reverse engineer R and see how they select scale for multiple GAM columns and other families.

Wendy needs to

A1. provide scripts to generate random GAM data and

A2. where the calculation of A is in the GLM code.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Completing A1: To generate synthetic GAM dataset for Gaussian family, use the following python file. Store it under the h2o-3/h2o-py/tests/testdir_algos/gam directory:

{noformat}import sys sys.path.insert(1,"../../../") import h2o from tests import pyunit_utils from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator as gam

This test will generate synthetic GAM dataset. If given to a GAM model, it should be able to perform well with

this dataset since the assumptions associated with GAM are used to generate the dataset.

def test_define_dataset(): family = 'binomial' # can be any valid GLM families nrow = 100000 ncol = 1 realFrac = 1 intFrac = 0 enumFrac = 0 missing_fraction = 0 factorRange= 50 numericRange = 10 targetFactor = 1 numGamCols = 1

assert numGamCols <= ncol*realFrac, "Number of real columns {0} should exceed the number of gam columns " \
                                   "{1}".format(ncol*realFrac, numGamCols) # gam can be only real columns
gamDataSet = generate_dataset(family, nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange, 
                              numericRange, targetFactor, numGamCols)
#h2o.download_csv(gamDataSet, "/Users/.../dataset.csv") # save dataset
assert gamDataSet.nrow == nrow, "Dataset number of row: {0}, expected number of row: {1}".format(gamDataSet.nrow, 
                                                                                                 nrow)
assert gamDataSet.ncol == (1+ncol), "Dataset number of row: {0}, expected number of row: " \
                                                      "{1}".format(gamDataSet.ncol, (1+ncol))

def generate_dataset(family, nrow, ncol, realFrac, intFrac, enumFrac, missingFrac, factorRange, numericRange, targetFactor, numGamCols): if family=="binomial": responseFactor = 2 elif family == 'gaussian': responseFactor = 1; else : responseFactor = targetFactor

trainData = random_dataset(nrow, ncol, realFrac=realFrac, intFrac=intFrac, enumFrac=enumFrac, factorR=factorRange, 
                           integerR=numericRange, responseFactor=responseFactor, misFrac=missingFrac)

myX = trainData.names
myY = 'response'
myX.remove(myY)

colNames = trainData.names
colNames.remove("response")
m = gam(family=family, max_iterations=10, gam_columns = colNames[0:numGamCols])
m.train(training_frame=trainData,x=myX,y= myY)
f2 = m.predict(trainData)
# to see coefficient, do m.coef()
finalDataset = trainData[myX]
finalDataset = finalDataset.cbind(f2[0])
finalDataset.set_name(col=finalDataset.ncols-1, name='response')

h2o.remove(trainData)
return finalDataset

def random_dataset(nrow, ncol, realFrac = 0.4, intFrac = 0.3, enumFrac = 0.3, factorR = 10, integerR=100, responseFactor = 1, misFrac=0.01, randSeed=None): fractions = dict() if (ncol==1) and (realFrac >= 1.0): fractions["real_fraction"] = 1 # Right now we are dropping string columns, so no point in having them. fractions["categorical_fraction"] = 0 fractions["integer_fraction"] = 0 fractions["time_fraction"] = 0 fractions["string_fraction"] = 0 # Right now we are dropping string columns, so no point in having them. fractions["binary_fraction"] = 0

    return h2o.create_frame(rows=nrow, cols=ncol, missing_fraction=misFrac, has_response=True,
                            response_factors = responseFactor, integer_range=integerR,
                            seed=randSeed, **fractions)

real_part = pyunit_utils.random_dataset_real_only(nrow, (int)(realFrac*ncol), misFrac=misFrac, randSeed=randSeed)
enumFrac = enumFrac + (1-realFrac)/2
intFrac = 1-enumFrac
fractions["real_fraction"] = 0  # Right now we are dropping string columns, so no point in having them.
fractions["categorical_fraction"] = enumFrac
fractions["integer_fraction"] = intFrac
fractions["time_fraction"] = 0
fractions["string_fraction"] = 0  # Right now we are dropping string columns, so no point in having them.
fractions["binary_fraction"] = 0

df = h2o.create_frame(rows=nrow, cols=(ncol-real_part.ncol), missing_fraction=misFrac, has_response=True, 
                      response_factors = responseFactor, integer_range=integerR,
                      seed=randSeed, **fractions)
return real_part.cbind(df)

if name == "main": pyunit_utils.standalone_test(test_define_dataset) else: test_define_dataset(){noformat}

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Nov 12, 2020

Meeting summary with Karthik M:

For the coming week, Karthik aims to finish the following:

For GAM Gaussian family, generate our very own cv lambda curve for leave one out cross validation with scale parameter varying over a range chosen by Karthik. The cv lambda metric is mean square error.

For GAM Binomial and Multinomial families, Karthik will generate cv lambda (logloss) for leave one out cross validation with scale parameter chosen again by Karthik. In addition, he will try to generate the cv lambda curve with nfold = 5 and nfold = 10 as well.

Make sure lambda = 0 and alpha = 0 for the experiment.

Other discussion includes:

Gram.java will generate the transpose(X)X+scaleS;

computeCholesky will generate the cholesky decomposition of the gram matrix;

chol.solve will solve for x given Gx=y where G is the gram matrix and has been decomposed using cholesky decomposition.

There are classes in h2o-3/h2o-algos/src/main/java/hex/util/LinearAlgebraUtils.java that take care of multiplication of a H2O frame and an array, or two H2O frames. See {{BMulTask}} and other classes for details.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Good free interactive courses to understand GAM:

[https://noamross.github.io/gams-in-r-course/|https://noamross.github.io/gams-in-r-course/]

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Nov 30, 2020

Met with karthik. Here is the meeting summary:

For GAM Gaussian family, generate our very own cv lambda curve for leave one out cross validation with scale parameter varying over a range chosen by Karthik. The cv lambda metric is mean square error. Choose your scale parameter as in 2. I did not run into problems setting nfold=data.nrows. After you build a GAM model called gam_model. You can see the total squared error as gam_model.mse(xval=True). However, you need to divide the total squared error with number of rows in your dataset.

For GAM Gaussian family,

!image-20201130-230501.png|width=485,height=161!

using synthetic GAM dataset. Set lambda=0, alpha=0 and scale parameter = 0 to find an estimate of the MSE and try scale parameter values from 0 to MSE with the interval set using the following methods: Let the maximum value that you are going to try be m. Set min_ratio to 1e-4. Let the number of scale parameters to try be n. Then, set decrement = Math.pow(min_ratio, 1.0/(n-1)). Then, use the following for loop to determine all scale parameters:

double scaleParam = new double[n];

scaleParam[0] = m;

for (int index=1; index < n; index++)

scaleParam[i] = (m *= dec).

Note that

!image-20201201-043904.png|width=386,height=64!

and in GLM, the following is calculated:

!image-20201201-043931.png|width=345,height=66!

You will need to use what GLM has but change them to calculate A instead. See our discussion during chat.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: December 3rd, 2020:

Met with Karthik and here is a summary of our discussion:

cross-validation error was due to the use of fold_assignment = auto. After switching it to modulo, the code ran.

Karhtik will be able to plot the mse(xval=True) for different scale parameters.

Figure out how to change the GLM

!image-20201203-222132.png|width=683,height=93!

to solve what he needs

!image-20201203-222148.png|width=291,height=33!

in order to calculate vg.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Let G = (XT(X)+lambdaS);

Gbeta = T(X)y = xy # this is what chol.solve is trying to solve

chol.sove will return beta;

We decided to get A by two parts. First, get B = inverse((XT(X)+lambdaS)) * T(X).

Then A = X*B.

B = inv(G)*T(X)

G*B = T(X) # solve for B one column at a time.

G*B1 = T(X) column1

To generalize for the ith column.

GB(identity matrix column i) = T(X)*identity matrix column i)

Part of the X matrix is stored in “_dinfo._adaptedFrame”. You are lacking the column of ones from “_dinfo._adaptedFrame”.

X = [“_dinfo._adaptedFrame” col(ones)]

To do this, here is an example on how to add the column of ones:

Frame X = new Frame(_dinfo._adaptedFrame);

X.add(“colOnes”, Vec.makeOne(_dinfo._adaptedFrame.numRows()));

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: December 14, 2020

Met with Karthik and we discussed how to generate the cross-validation score.

To change an array into a H2OFrame, use {{Frame xwF = new water.util.ArrayUtils().frame(yArray);}}

To change an H2OFrame (not too big) into an array, use {{double[] tempWeights = new FrameUtils.Vec2ArryTsk(column_number).doAll(myFrame.vec(column_number)).res;}}

To manually perform cross-validation with mode='modulo', you can use the following code:

{noformat}mseSum = 0 nrow = data.nrow for ind in range(0, nrow): valid = data[ind,0:2] temp1 = data[0:ind, 0:2] temp2 = data[ind+1:nrow, 0:2] if (temp1.nrow==0): train = temp2 elif (temp2.nrow==0): train = temp1 else: train = temp1.rbind(temp2) h2o_model = H2OGeneralizedAdditiveEstimator(family='gaussian', gam_columns=["C1"]) h2o_model.train(y='response', training_frame=train, validation_frame=valid) mseSum = mseSum+h2o_model.mse(valid=True) h2o.remove(valid) # need to remove them so as not to clog h2o.remove(temp1) # up the memory of the machine h2o.remove(temp2) h2o.remove(train){noformat}

If you have created a gam model and wanted to call computeGram method, you need to supply it with gam._output.getNormBeta(). The gam._output.beta() will provide you with coefficients that work with non-standardized predictors.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: December 21, 2020

Meeting summary with karthik today:

bug in GAM with scale parameter. Link to Jira here: [https://h2oai.atlassian.net/browse/PUBDEV-7932?jql=text%20~%20%22GAM%22|https://h2oai.atlassian.net/browse/PUBDEV-7932?jql=text%20~%20%22GAM%22|smart-link]

To see how to start a job, see ModelBuilder.java, line 75.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: December 28, 2020

Met with Karthik and we discussed the following:

Karthik has finished the calculation of A. He needs to just add the mse and then he is on his way to get 3.19.

He will move the code as part of GAM package.

He will need to come up with meaningful test to make sure it works.

In addition, he is ready to move onto looking at how R performs the scale parameter calculation using the CV score. If you still have problems using the script I gave you, change all the code to using r commands only.

My branch with the fix for scale parameter is PUBDEV_7932_gam_scale_parameter_not_used. If you cannot find it anymore, it is merged into rel-zermelo.

exalate-issue-sync[bot] commented 1 year ago

Karthik Murthy commented: {noformat}import sys sys.path.insert(1,"../../../") import h2o from tests import pyunit_utils from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator as gam import pandas as pd import matplotlib.pyplot as plt import numpy as np

class cv_graph_generator:

loss_data = []
# This test will generate synthetic GAM dataset.  If given to a GAM model, it should be able to perform well with 
# this dataset since the assumptions associated with GAM are used to generate the dataset.
def test_define_dataset(self):
    family = 'gaussian' # can be any valid GLM families
    nrow = 100
    ncol = 1
    realFrac = 1
    intFrac = 0
    enumFrac = 0
    missing_fraction = 0
    factorRange= 50
    numericRange = 10
    targetFactor = 1
    numGamCols = 1
    min_ratio = 1e-1
    num_trials = 10
    nfolds = nrow

    loss = self.generate_dataset(family, nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
                                 numericRange, targetFactor, numGamCols, min_ratio, nfolds, num_trials)
    df = pd.DataFrame(loss)
    df.to_csv("loss-data.csv")
    print(df)
    print("Done")
    # self.loss_data.append(self.generate_dataset("binomial", nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
    #                                             numericRange, targetFactor, numGamCols, scale, nfolds, scale_div))
    # 
    # self.loss_data.append(self.generate_dataset("binomial", nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
    #                                             numericRange, targetFactor, numGamCols, scale, 5, scale_div))
    # 
    # self.loss_data.append(self.generate_dataset("multinomial", nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
    #                                             numericRange, 5, numGamCols, scale, nfolds, scale_div))
    # 
    # self.loss_data.append(self.generate_dataset("multinomial", nrow, ncol, realFrac, intFrac, enumFrac, missing_fraction, factorRange,
    #                                             numericRange, 5, numGamCols, scale, 5, scale_div))

def generate_dataset(self, family, nrow, ncol, realFrac, intFrac, enumFrac, missingFrac, factorRange, numericRange,
                     targetFactor, numGamCols, min_ratio=1e-4, nfolds=0, num_trials=1):
    if family=="binomial":
        responseFactor = 2
    elif family == 'gaussian':
        responseFactor = 1
    else :
        responseFactor = targetFactor

    trainData = self.random_dataset(nrow, ncol, realFrac=realFrac, intFrac=intFrac, enumFrac=enumFrac, factorR=factorRange,
                                    integerR=numericRange, responseFactor=responseFactor, misFrac=missingFrac)

    myX = trainData.names
    myY = 'response'
    myX.remove(myY)

    colNames = trainData.names
    colNames.remove("response")
    avg_loss = []
    scale = 2947.189508523056 * 1000
    scaleParam = []
    for i in range(2, num_trials + 2):
        dec = min_ratio**(1.0/(i - 1))
        scale *= dec
        scaleParam.append(scale)
        m = gam(family=family, gam_columns = colNames[0:numGamCols], lambda_=0, alpha=0, scale=[scale], nfolds=nfolds, fold_assignment="modulo", seed=1)
        m.train(training_frame=trainData, x=myX, y=myY)
        # loss = 0
        # for j in range(nfolds):
        #     loss += (m.cross_validation_models()[j].mse() / nfolds)
        avg_loss.append((-(2 * (i - 1)), m.mse(xval=True)))
    f2 = m.predict(trainData)
    # to see coefficient, do m.coef()
    finalDataset = trainData[myX]
    finalDataset = finalDataset.cbind(f2[0])
    finalDataset.set_name(col=finalDataset.ncols-1, name='response')
    h2o.download_csv(finalDataset, "dataset.csv")
    return avg_loss

def random_dataset(self, nrow, ncol, realFrac = 0.4, intFrac = 0.3, enumFrac = 0.3, factorR = 10, integerR=100,
                   responseFactor = 1, misFrac=0.01, randSeed=7):
    fractions = dict()
    if (ncol==1) and (realFrac >= 1.0):
        fractions["real_fraction"] = 1  # Right now we are dropping string columns, so no point in having them.
        fractions["categorical_fraction"] = 0
        fractions["integer_fraction"] = 0
        fractions["time_fraction"] = 0
        fractions["string_fraction"] = 0  # Right now we are dropping string columns, so no point in having them.
        fractions["binary_fraction"] = 0

        return h2o.create_frame(rows=nrow, cols=ncol, missing_fraction=misFrac, has_response=True,
                                response_factors = responseFactor, integer_range=integerR,
                                seed=randSeed, **fractions)

    real_part = pyunit_utils.random_dataset_real_only(nrow, (int)(realFrac*ncol), misFrac=misFrac, randSeed=randSeed)
    enumFrac = enumFrac + (1-realFrac)/2
    intFrac = 1-enumFrac
    fractions["real_fraction"] = 0  # Right now we are dropping string columns, so no point in having them.
    fractions["categorical_fraction"] = enumFrac
    fractions["integer_fraction"] = intFrac
    fractions["time_fraction"] = 0
    fractions["string_fraction"] = 0  # Right now we are dropping string columns, so no point in having them.
    fractions["binary_fraction"] = 0

    df = h2o.create_frame(rows=nrow, cols=(ncol-real_part.ncol), missing_fraction=misFrac, has_response=True,
                          response_factors=responseFactor, integer_range=integerR,
                          seed=randSeed, **fractions)
    return real_part.cbind(df)

def generate_graphs(): generator = cv_graph_generator() generator.test_define_dataset() for dataset in generator.loss_data: print(dataset) print("done")

if name == "main": h2o.init(ip='192.168.1.4', port=54321, strict_version_check=False) pyunit_utils.standalone_test(generate_graphs()) else: h2o.init(ip='192.168.1.4', port=54321, strict_version_check=False) generate_graphs() {noformat}

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Jan 4, 2021 Meeting summary with Karthik

Wendy will checkout code to generate cross-validation scores using the file Karthik has sent.

The data that is used to generate figure 4.7 can be found in R:

require(gamair); data(engine); attach(engine)

plot(size,wear,xlab="Engine capacity",ylab="Wear index")

The rest of the code can be found from page 165 to 170 of the document that I have sent you before. It is chapter 4 of Simon Woods book.

All the formula describing the X matrix will assume that we have added a column of 1’s for the intercept coefficient. However, in model._output._dinfo._adaptedFrame, there is no column of ones added and it is okay when you try to use chol.solve to get the coefficients. However, in your last stage where you are going to obtain X * B, you will need to add the column of ones to be the last column of X.
You are going to solve for B one column at a time. In our discussion, each column of B is an array. The final B will be converted to a frame. There are multiple ways to generate a frame. One is to call new Frame on an array of vectors. Here is an example:

// generate keys for all the vectors

{{Key[] keys = Vec.VectorGroup.VG_LEN1.addVecs(ncoly); for (int y = 0; y < ncoly; y++) { }}

{{ }}}

You can get a frame by

{{new Frame(col_names, res}});// col_names is a string array

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: The scale parameter controls overfitting if too many knots are specified. Our dataset is generated with a small number of knots and therefore there are not wriggliness. I have generated a new dataset with more wriggliness and here it is.

[^gam_1Col_40perKnots_2000Rows.csv]

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Try this code and you will see the mse actually is a bowl:

{noformat}import sys sys.path.insert(1,"../../../") import h2o from tests import pyunit_utils from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator as gam

def generate_graphs():

train = h2o.import_file("/Users/wendycwong/temp/gam_1Col_2000rows.csv")

input = h2o.import_file("/Users/wendycwong/temp/gam_1Col_40perKnots_2000Rows.csv")
frames = input.split_frame(ratios=[0.05])   # can change the ratio to higher, I was running out of memory
train = frames[0]
response = 'response'
scale_parameter = [0, 0.0001, 0.001, 0.01, 0.1, 1, 10]
num_knots = [int(0.1*train.nrow), int(train.nrow*0.2), int(0.3*train.nrow), int(0.4*train.nrow), int(train.nrow*0.5)]
xval_mse_total = []
val_mse_total = []
for numKnots in num_knots:
    xval_mse = []
    val_mse = []
    for scale in scale_parameter:
        gam_model = gam(family = "gaussian", alpha = 0, Lambda = 0, gam_columns = ["C1"], scale = [scale], 
                    nfolds = train.nrow, fold_assignment="modulo", num_knots=[numKnots])
        gam_model.train(x=[], y=response, training_frame = train, validation_frame=frames[1])
        xval_mse.append(gam_model.mse(xval=True))
        val_mse.append(gam_model.mse(valid=True))
    xval_mse_total.append(xval_mse)
    val_mse_total.append(val_mse)
    print(xval_mse_total)
    print(val_mse_total)

print(xval_mse_total)
print(val_mse_total)

if name == "main": h2o.init(ip = "192.168.86.41", port = 54321, strict_version_check=False) pyunit_utils.standalone_test(generate_graphs()) else: h2o.init(ip = "192.168.86.41", port = 54321, strict_version_check=False) generate_graphs(){noformat}

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: For cases with a fixed number of knots, if we do not see a bowl, it probably means that no overfitting is observed and we can just set scale close to be 0

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Karthik:

Attached is a good document describing overfitting:

[https://www.eecis.udel.edu/~arce/files/Courses/StatLearning/Overfitting and Regularization.pdf|https://www.eecis.udel.edu/~arce/files/Courses/StatLearning/Overfitting%20and%20Regularization.pdf]

We can discuss this on Monday.

Wendy

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Jan 11, 2021

Had a meeting with Karthik and we discussed the following:

karthik is almost done implementing the generalized cross-validation score. He will move his code into h2o-3-algos/main/gam and makes a PR. Wendy will review the code and suggest test cases if they are missing;

Karthik will generate the theoretical generalized cross-validation score and compare it with the one calculated from 1 using the same dataset and the same scale parameter values. Plot the comparisons and add them to this JIRA.

Add a new user parameter which will allow the user to choose number of scale parameters to try during our optimization process. Set a default value to say 3.

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Jan 18, 2021

Summarized discussion with Karthik here:

Karthik made a PR on his work: [https://github.com/h2oai/h2o-3/pull/5241|https://github.com/h2oai/h2o-3/pull/5241|smart-link] Wendy will review it first and in addition provide test ideas.

Karthik, please provide your dataset to Wendy. Thanks.

There is a problem running CVScoreGenerator when the numknots is greater than 50. Wendy to check and see what the issue is.

For this week, Karthik will focus on these two tasks:

Karthik will look into R and see how R implements the scale parameter for the following three cases: a). one gam column; b) two gam columns; c) one normal predictor and two gam columns. You may need to generate datasets for case b and case c.

Karthik will provide a curve showcasing the effect of numknots and scale parameters.

exalate-issue-sync[bot] commented 1 year ago

Karthik Murthy commented: Here is the test dataset I used when generating cross-validation scores:

[^good-dataset.csv]

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Jan 25, 2021:

Summarized discussion with Karthik:

Karthik will write a documentation on how he implemented the GACV. This doc will be included into gam documentation.

Karthik will continue his reverse-engineering of R GAM scale parameter optimization. He will document what he has found. It will be interesting to see how R performs the optimization for the following three cases:

when there is only one gam column;
when there are two gam columns;
when there are two gam columns and one normal predictor column.

{noformat}{noformat}

exalate-issue-sync[bot] commented 1 year ago

Wendy commented: Feb 11, 2021:

Met with Karthik and here is the summary:

Wendy will ready magic.c code from Karthik and understand what is going on with scale parameter optimizaiton.

Karthik will discard his original PR and open an new PR inside H2O

Karthik will try out the code with binomial family and see if the scale parameter optimization is different as in Gaussian.

Wendy suggests that Karthik subscribe to the following slack channel: general, dev-h2o-3, engineering, jira, hr, office, random, welcome-to-h2o, save

continue to update your document on GCV.

Thanks, Wendy

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7865 Assignee: Karthik Murthy Reporter: Wendy State: Open Fix Version: N/A Attachments: Available (Count: 9) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/5241 https://github.com/h2oai/h2o-3/pull/5332

Attachments From Jira

Attachment Name: gam_1Col_40perKnots_2000Rows.csv Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/gam_1Col_40perKnots_2000Rows.csv

Attachment Name: GAM_doc.pdf Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/GAM_doc.pdf

Attachment Name: good-dataset.csv Attached By: Karthik Murthy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/good-dataset.csv

Attachment Name: image-20201106-231310.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201106-231310.png

Attachment Name: image-20201130-230501.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201130-230501.png

Attachment Name: image-20201201-043904.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201201-043904.png

Attachment Name: image-20201201-043931.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201201-043931.png

Attachment Name: image-20201203-222132.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201203-222132.png

Attachment Name: image-20201203-222148.png Attached By: Wendy File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7865/image-20201203-222148.png