trevorlapay / UNMFall18_ML_Project2_NB_LR

Machine Learning Project 2
0 stars 0 forks source link

Logistic Regression #2

Open trevorlapay opened 6 years ago

trevorlapay commented 6 years ago

I think Naive Bayes was relatively easy, but I'm a little unsure about what we need to do for LR. I don't see anyone positing about this on piazza, so either nobody is confused about this, or they haven't gotten to it yet.

My vague conception of what we need to do:

a) create constants: k, n, learning rate, penalty = easy, these are givens or things that we tinker with b) create Delta matrix: Not 100% sure what this is. Equation 29 is the gradient application, something like a cost function, used to determine weights. But the delta implies this is a gradient descent matrix. The project description makes it sound like this may be a trivial value (a one or a zero) depending on class. I'm confused. This isn't the weight matrix, which is calculated below. What does this matrix do? c) X = matrix of all training examples. d) Y = actual classification vector e) W = weights (everyone else on Earth calls these values theta, don't know why she and the author use W.) It looks like this is calculated using the partial derivatives gleaned from the cost function (which is in the delta matrix? Maybe?) which can be found using the update step for logistic regression in the project description. f) Probability values. As far as I understand it, this is simply the normalized application of the equations (27) and (28) based on the weights selected in e, although I don't fully understand why we need (28) - this is like the (n-1) parameters thing where you can derive the final parameter based on the others. But ok.

So once we have all of these components, we can classify something new by plugging their feature values into the weighted function created using the tools above and pulling out the max probability. I'm uncertain how exactly you use the probability values to classify a new example, or what to do with the conditional data log likelihood equation. Is that the classifier?

I feel like I have a very dim intuition for how these pieces fit together because we didn't go through a concrete example in class.

trevorlapay commented 6 years ago

Also, at minute 56 in this lecture:

https://vod1.unm.edu/Mediasite/Play/70c197d6a76942fa8f3be44555748fba1d?catalog=a8e306d7-032a-460a-9ee2-8c4b98341700

She mentions that we can use scikit learn to try and converge (which I assume is related to LR implementing gradient descent to find the maximum). She makes it sound like LR is really trivial and that all we need to do is use the update function and tune the parameters. Am I way overthinking this?

trevorlapay commented 6 years ago

Sorry for the spam here, just thinking through a few things...

After reviewing the lectures, I think it's clear that the delta matrix is just a matrix of 0s and 1s depending on the class of the given example. It's k by m and is just a table identifying class for each example. If this is true, I'm not sure what she means by "using the delta equation..."

This leaves me wondering how to calculate the probability matrix. It appears that this matrix is dependent on a set of weights, W. But looking at the update algorithm, which I believe is used to generate the weights, calculating the weights matrix is dependent on the probability matrix., Clearly this is not possible.

In context, it looks like we need to generate the probability matrix before going into the update algorithm, but it's unclear to me how to do this.

Once we have everything, it's a little more clear that we use the conditional likelihood to find the max probability given the weights and feature values, but I'm still hazy on the ordering of when to calculate the weights versus probabilities.

hankyusa commented 5 years ago

I made a video to sum-up my talk with the instructor.

trevorlapay commented 5 years ago

Thanks for clearing that up. Did you ask about the weights matrix? I'm still confused about how we calculate the weights matrix versus the probability matrix since the prob matrix relies on the weights matrix, and the update step (for the weights, I assume) relies on the probabilities. Do we just initialize the weights matrix to some arbitrary number and run the update on that?

On Thu, Oct 4, 2018 at 2:21 PM Luke Hanks notifications@github.com wrote:

I made a video https://youtu.be/WkFZsf6gOi0 to sum-up my talk with the instructor.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/trevorlapay/UNMFall18_ML_Project2_NB_LR/issues/2#issuecomment-427156071, or mute the thread https://github.com/notifications/unsubscribe-auth/ALDXerue7YfMNkurERmLCOvQNKN901x1ks5uhm3ZgaJpZM4W_Chf .

trevorlapay commented 5 years ago

I should say that I get that the update step is gradient descent. The gradient descent she taught in class looks a bit different that the one in the project description, though - it's possible I am misunderstanding how the probability matrix is getting generated (is it actually using the weights from the weight matrix, or am I reading it wrong?)

On Thu, Oct 4, 2018 at 8:39 PM Trevor La Pay trevor.n.lapay@gmail.com wrote:

Thanks for clearing that up. Did you ask about the weights matrix? I'm still confused about how we calculate the weights matrix versus the probability matrix since the prob matrix relies on the weights matrix, and the update step (for the weights, I assume) relies on the probabilities. Do we just initialize the weights matrix to some arbitrary number and run the update on that?

On Thu, Oct 4, 2018 at 2:21 PM Luke Hanks notifications@github.com wrote:

I made a video https://youtu.be/WkFZsf6gOi0 to sum-up my talk with the instructor.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/trevorlapay/UNMFall18_ML_Project2_NB_LR/issues/2#issuecomment-427156071, or mute the thread https://github.com/notifications/unsubscribe-auth/ALDXerue7YfMNkurERmLCOvQNKN901x1ks5uhm3ZgaJpZM4W_Chf .

hankyusa commented 5 years ago

The equation is definitely a little different, we're not even using the full sigmoid function, right? The equations seemed out if no where to me...

The probability matrix is going to use the weight matrix though. We're going to optimize the weight matrix so the probability matrix is as accurate as possible for the training set, and then hope those tuned weights make sense for our validation set.

If we initialize the weight matrix to 0, or something in between 0 and 1 we will be able to compute the first iteration of the probability matrix, and then use the update step of gradient descent to tune the weight matrix.

It should be really wrong the first iteration, but the delta function is supposed to push the weights to the correct value eventually.

trevorlapay commented 5 years ago

THANK YOU. This is what I was misunderstanding.

On Thu, Oct 4, 2018, 9:59 PM Luke Hanks notifications@github.com wrote:

The equation is definitely a little different, we're not even using the full sigmoid function, right? The equations seemed out if no where to me...

The probability matrix is going to use the weight matrix though. We're going to optimize the weight matrix so the probability matrix is as accurate as possible (95%) for the training set, and then hope those tuned weights make sense for our validation set.

If we initialize the weight matrix to 0, or something in between 0 and 1 we will be able to compute the first iteration of the probability matrix, and then use the update step of gradient descent to tune the weight matrix.

It should be really wrong the first iteration, but the delta function is supposed to push the weights to the correct value eventually.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/trevorlapay/UNMFall18_ML_Project2_NB_LR/issues/2#issuecomment-427239209, or mute the thread https://github.com/notifications/unsubscribe-auth/ALDXess7zEq-Nm8xyCZexjlaKWLicp0Kks5uhtk3gaJpZM4W_Chf .

trevorlapay commented 5 years ago

So if I understand correctly, high level steps for running logistic regression, assuming we are not using a delta {1, 0} matrix and are running it class by class on grouped example matrices, looks like this:

1) Set integer constants (learning rate, penalty term, number of classes) 2) Initialize individual matrices of example data by class, either into 20 individual files or in memory (I expect this would take forever to do in memory every time) 3) Initialize Y, vector of true classifications 4) Initialize a single Weights matrix to some 0 < w < 1 value 5) Create 20 probability matrices using exp function, matrix multiply W by X transpose (one for each class). Since we are grouping examples by class, I think this means we can use 20 weight vectors (since the weight matrix is class by attribute, and we only care about the particular class matrix we're working with?). Fill in a column of all 1s and normalize at the end so the values sum to one. 6) Now that we have everything, we converge class by class. We run the update step, which no longer uses the delta matrix and instead uses 1 in its place since we know all examples are part of a given class. The update step will stop when the cost portion of the update step doesn't change the weight at a certain threshold. 7) Once we have all of our weights, we can use the MLE formula at the top of model 3, which I think will tell us whether an example belongs to a given class if the log likelihood is > 0 (or do we take the argmax of the example across all given classes?)

hankyusa commented 5 years ago

That looks right to me. I think what you're doing in place of delta makes sense. To make a classification prediction I think we would run the document data through all of our class matrices and take the argmax as the result.

hankyusa commented 5 years ago

I finished logistic regression in ConstructionZone2.py. It got 89% on the test data, but the params still need to be experimented with. Also I wrote functions to split the labeled (training.csv) data into a training set and validation set. There are also generic validate and test functions. I didn't comment my code. Sorry.