mikeizbicki / cmc-csci145-math166

Data Mining
34 stars 57 forks source link

Question 5.1 SLT Programming #33

Closed sarahbashir closed 3 years ago

sarahbashir commented 4 years ago

I'm having some trouble running the code I wrote on my computer. It's taking an exceptionally long time for low p values and just not generating anything for middle-high p values. Any idea what could be going wrong in my code? I'm looping through the different p-values from 1-7 and generating the plots.

def calculate_err_vs_p(d, sigma, f, m, num_trials=default_num_trials):
    '''
    Plots the sample/true/generalization error for the halfspace with polynomial kernel model
    as a function of the number of sample training points.
    '''

    max_m_exp = 16
    m_buffer = 3

    # S_test is our test set used for measuring the model's performance;
    # It has a very large size to ensure that the empirical risk on S_test is very close to the true risk
    print('generating dataset... ',end='')
    S_test = generate_dataset(m=2**(max_m_exp+m_buffer),d=d,f=f,sigma=sigma)
    print('done')
    plot_dataset(S_test)

    # these lists store the computer training and test errors
    test_errs = []
    train_errs = []

    # This is the list of all sample sizes we will train models on and generate train/test errors;
    # by adjusting the range(), you can adjust the x-axis in the plots below.
    ps = [1,2,3,4,5,6,7]

    for p in ps:

        # In order to "smooth" the plots, we will repeat each experiment multiple times
        # as deterimed by the num_trials parameter.
        # These lists store the raw results from each trial.
        trials_test_accs = []
        trials_train_accs = []

        # loop over each trial
        seed_base = 10
        time_start = time.time()
        for seed in range(seed_base,seed_base+num_trials):

            # generate a training set of size m
            # from the same distribution as our test set;
            # notice that we must explicitly set a unique seed for each trial so that
            # each iteration is actually running on a different training set
            S_train = generate_dataset(m=m,d=d,f=f,sigma=sigma,seed=seed)

            try:   
                # train a linear model;
                # notice that the training currently uses the LogisticRegression model;
                # all of the results will be essentially the same using the other linear models as well
                X, Y = S_train
                X = np.apply_along_axis(lambda x: polynomial_kernel_embedding(x,p),1,X)
                h_S = sklearn.linear_model.LogisticRegression(solver='liblinear',C=1e10)
                #h_S = sklearn.linear_model.Perceptron()
                #h_S = sklearn.linear_model.SGDClassifier()
                #h_S = sklearn.linear_model.PassiveAggressiveClassifier()
                #h_S = sklearn.discriminant_analysis.LinearDiscriminantAnalysis()
                #h_S = sklearn.svm.LinearSVC()
                h_S.fit(X, Y)

                # calculate the training accuracy
                train_acc = h_S.score(X,Y)

                # calculate the test accuracy
                X, Y = S_test
                X = X[:min(2048,m_buffer*m)]
                Y = Y[:min(2048,m_buffer*m)]
                X = np.apply_along_axis(lambda x: polynomial_kernel_embedding(x,p),1,X)
                test_acc = h_S.score(X, Y)        

            # ValueError raised when there's not enough data to perform classification;
            # in this case, we get perfect training accuracy, but perfectly wrong test accuracy
            except ValueError:
                train_acc = 1
                test_acc = 0

            trials_test_accs.append(test_acc)
            trials_train_accs.append(train_acc)
        time_end = time.time()

        # compute the average of our trials
        train_acc = np.mean(trials_train_accs)
        test_acc = np.mean(trials_test_accs)

        # print a debugging statement for each iteration
        print('p=%8d,  train_acc=%0.4f,  test_acc=%0.4f,  time_diff=%dsec'%(
            p,
            train_acc,
            test_acc,
            time_end-time_start
        ))

        # convert the accuracies into errors and store them
        train_errs.append(1-train_acc)
        test_errs.append(1-test_acc)

    # plot the errors
    fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(14,5))
    ax1.set_xscale('log',basex=2)
    #ax1.set_yscale('log')
    ax1.set_ylim([0.0,1.0])
    ax1.set(
        xlabel='polynomial degree = p', 
        ylabel='train error = L_S(h_S)',
    )
    ax1.plot(ps,train_errs)

    ax2.set_xscale('log',basex=2)
    #ax2.set_yscale('log')
    ax2.set_ylim([0.0,1.0])
    ax2.set(
        xlabel='polynomial degree = p', 
        ylabel='test error ≈ L_D(h_S)',
    )
    ax2.plot(ps,test_errs)

    ax3.set_xscale('log',basex=2)
    ax3.set_yscale('log')
    ax3.set(
        xlabel='polynomial degree = p', 
        ylabel='generalization error = |L_D(h_S) - L_S(h_S)|',

    )
    ax3.plot(ps,np.abs(np.array(test_errs)-np.array(train_errs)))

    plt.tight_layout()
    plt.show()
calculate_err_vs_p(
    d=8,
    sigma=0.0,
    f=f_polynomial(p=6),
    m=32,
)
karmishthaseth commented 4 years ago

I’ve been having the same problem too! I keep stopping the kernel and restarting it but it’s not working.

On Sun, Oct 4, 2020 at 6:44 PM sarahbashir notifications@github.com wrote:

I'm having some trouble running the code I wrote on my computer. It's taking an exceptionally long time for low p values and just not generating anything for middle-high p values. Any idea what could be going wrong in my code? I'm looping through the different p-values from 1-7 and generating the plots.

def calculate_err_vs_p(d, sigma, f, m, num_trials=default_num_trials):

'''

Plots the sample/true/generalization error for the halfspace with polynomial kernel model

as a function of the number of sample training points.

'''

max_m_exp = 16

m_buffer = 3

# S_test is our test set used for measuring the model's performance;

# It has a very large size to ensure that the empirical risk on S_test is very close to the true risk

print('generating dataset... ',end='')

S_test = generate_dataset(m=2**(max_m_exp+m_buffer),d=d,f=f,sigma=sigma)

print('done')

plot_dataset(S_test)

# these lists store the computer training and test errors

test_errs = []

train_errs = []

# This is the list of all sample sizes we will train models on and generate train/test errors;

# by adjusting the range(), you can adjust the x-axis in the plots below.

ps = [1,2,3,4,5,6,7]

for p in ps:

    # In order to "smooth" the plots, we will repeat each experiment multiple times

    # as deterimed by the num_trials parameter.

    # These lists store the raw results from each trial.

    trials_test_accs = []

    trials_train_accs = []

    # loop over each trial

    seed_base = 10

    time_start = time.time()

    for seed in range(seed_base,seed_base+num_trials):

        # generate a training set of size m

        # from the same distribution as our test set;

        # notice that we must explicitly set a unique seed for each trial so that

        # each iteration is actually running on a different training set

        S_train = generate_dataset(m=m,d=d,f=f,sigma=sigma,seed=seed)

        try:

            # train a linear model;

            # notice that the training currently uses the LogisticRegression model;

            # all of the results will be essentially the same using the other linear models as well

            X, Y = S_train

            X = np.apply_along_axis(lambda x: polynomial_kernel_embedding(x,p),1,X)

            h_S = sklearn.linear_model.LogisticRegression(solver='liblinear',C=1e10)

            #h_S = sklearn.linear_model.Perceptron()

            #h_S = sklearn.linear_model.SGDClassifier()

            #h_S = sklearn.linear_model.PassiveAggressiveClassifier()

            #h_S = sklearn.discriminant_analysis.LinearDiscriminantAnalysis()

            #h_S = sklearn.svm.LinearSVC()

            h_S.fit(X, Y)

            # calculate the training accuracy

            train_acc = h_S.score(X,Y)

            # calculate the test accuracy

            X, Y = S_test

            X = X[:min(2048,m_buffer*m)]

            Y = Y[:min(2048,m_buffer*m)]

            X = np.apply_along_axis(lambda x: polynomial_kernel_embedding(x,p),1,X)

            test_acc = h_S.score(X, Y)

        # ValueError raised when there's not enough data to perform classification;

        # in this case, we get perfect training accuracy, but perfectly wrong test accuracy

        except ValueError:

            train_acc = 1

            test_acc = 0

        trials_test_accs.append(test_acc)

        trials_train_accs.append(train_acc)

    time_end = time.time()

    # compute the average of our trials

    train_acc = np.mean(trials_train_accs)

    test_acc = np.mean(trials_test_accs)

    # print a debugging statement for each iteration

    print('p=%8d,  train_acc=%0.4f,  test_acc=%0.4f,  time_diff=%dsec'%(

        p,

        train_acc,

        test_acc,

        time_end-time_start

    ))

    # convert the accuracies into errors and store them

    train_errs.append(1-train_acc)

    test_errs.append(1-test_acc)

# plot the errors

fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(14,5))

ax1.set_xscale('log',basex=2)

#ax1.set_yscale('log')

ax1.set_ylim([0.0,1.0])

ax1.set(

    xlabel='polynomial degree = p',

    ylabel='train error = L_S(h_S)',

)

ax1.plot(ps,train_errs)

ax2.set_xscale('log',basex=2)

#ax2.set_yscale('log')

ax2.set_ylim([0.0,1.0])

ax2.set(

    xlabel='polynomial degree = p',

    ylabel='test error ≈ L_D(h_S)',

)

ax2.plot(ps,test_errs)

ax3.set_xscale('log',basex=2)

ax3.set_yscale('log')

ax3.set(

    xlabel='polynomial degree = p',

    ylabel='generalization error = |L_D(h_S) - L_S(h_S)|',

)

ax3.plot(ps,np.abs(np.array(test_errs)-np.array(train_errs)))

plt.tight_layout()

plt.show()

calculate_err_vs_p(

d=8,

sigma=0.0,

f=f_polynomial(p=6),

m=32,

)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mikeizbicki/cmc-csci145-math166/issues/33, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP56EFLTGRZIVZZGN6D2JY3SJD3DZANCNFSM4SD7WMYA .

-- Thank you, Karmishtha Seth

hanaknight commented 4 years ago

I got the computational part to work all the way from p=1-10 but my problem is in the graph generation segment, keep getting this error: ValueError: x and y must have same first dimension, but have shapes (1,) and (10,)

mikeizbicki commented 4 years ago

I think questions like this will require zoom meetings to go over. I'm happy to meet with anyone in office hours about these problems for anyone who is still struggling to get it to work.

benfig1127 commented 4 years ago

I am also still having trouble getting it to work, so that would be super helpful!