rae89 commented 9 years ago

Hi, I was wondering if you could explain the reasoning behind the way wTx inner product is being implemented in the get_p method. From reading the code it seems that the x features that have been hashed are not being used in the computation of the weight vector. Shouldn't the the inner product of the predict function be wTx = w[i] * i, which would follow along the lines of a typical logistic regression computation? Because currently it looks like it is just summing the computed weights, with no affect from the actual feature vector.

mlandry22 commented 9 years ago

Hi Rae89

Versions 1 and 2 have a get_p method, so I'm assuming you're talking about one of those, copied below.

You are correct that you would multiply by i, but notice the IF statement above that controls whether you use each weight. X is a list of all the positive features (everything is either a 0 or 1), so the only weights returned are those that have a 1, and thus the multiplication is not necessary. You can look at that IF statement as being an efficient way to only get the weights that are needed, per prediction.

# C. Get probability estimation on x
# INPUT:
#     x: features
#     w: weights
# OUTPUT:
#     probability of p(y = 1 | x; w)
def get_p(x, w):
    wTx = 0.
    for i in x:  # do wTx
        wTx += w[i] * 1.  # w[i] * x[i], but if i in x we got x[i] = 1.
    return 1. / (1. + exp(-max(min(wTx, 20.), -20.)))  # bounded sigmoid

rae89 commented 9 years ago

Thanks for the reply. I still don't understand why the x is a 0 or 1. Since the x is being generated by the data method:

build x

    x = []
    for key in row:
        value = row[key]

        # one-hot encode everything with hash trick
        index = abs(hash(key + '_' + value)) % D
        x.append(index)

    yield t, date, ID, x, y

Since x was built from the method above, wouldn't the x be hashed index values? Or am I missing something?

mlandry22 commented 9 years ago

You're welcome. It is the index of the hashed value, rather than the hash of an index value. So "cat" will get hashed, and the result of that hash is all you need to encode the existence of "cat" for the particular training row. This index is the feature, and that index guarantees that the feature is 1. So X winds up storing only the indexes that are 1; everything else is a 0. This way you can check for features quickly, update the weights quickly, and guarantee that any new values are handled (perhaps not the way you'd want them, but it will not break). So there becomes no such thing as a feature value. The entire representation is simply 0 or 1--numerics hash into the same space as categoricals, so you lose all numeric sequencing with this encoding scheme, but it makes it very quick to implement. In the paper I read most thoroughly for this talk, Google reported that using a dictionary to store features performed better for them. It's faster to not worry about a dictionary, but there can be an accuracy tradeoff (hashing collisions).

rae89 commented 9 years ago

So there are D bins that each possible value can be hashed into. And by applying the hash function to feature_name + feature_value we get some index correct? but this index isn't 1 or 0 quite yet right? Does the one-hot-encoding transform the hashed index to a 1 or zero. or is the one-hot-encode performed first then that value is hashed? Is each row of the original dataset represented by the x vector, where each row of the x vector is a a string of binary digits? Or is it a single value 1 or 0? In your cat example, if we hash('cat') % D it will output a value thats less than D. And this value is the key that references the 'cat' value in the dictionary. How is this key converted to a 1 or 0?

mlandry22 commented 9 years ago

There is no explicit one-hot encoding. The hash function is implicitly doing the one-hot for you: there is no 0 or 1, simply presence/absence in the index table. There are D bins, but all are 0 for a particular row except those stored in X, which represent a 1. So yes, the index we get from hashing feature_name + feature_value is a 1 or 0: it is 1, every time. It's a sparse representation, so X stores the values of the 1's, and the 0's are never stored. So every row starts with an empty X vector, and as the features are read, it fills up the X vector with values. These values aren't anything other than a fast way to ensure all "cat" instances share the same weights, used for weight updates and predictions. Having a value in X means the row contains that feature, and thus should use the existing weight to predict, and then update that same weight after the loss is computed from the prediction.

rae89 commented 9 years ago

Ok now I think I have a better understanding of what's going on here. But I have one more questions, hopefully the last one. So since there are no more features once they are all hashed and indexed. Are there technically D weights now in this setup versus if we were to compute weights in the original dataset where there would be a weight for each feature. The reason I ask is because the dictionary w is being indexed by i which are indices of x and therefore eventually populating D entries. Thank again for your feedback, it helps being able to talk about the problem.

mlandry22 commented 9 years ago

Glad to help. You might find it helpful to run a really small model and see what happens. Change the bits from 20 to something small like 4 or 6, and then you can print out the entire weight space on each iteration and follow what it is doing a little bit.

There will only be weights where D has at least one feature, though that entire space of D could include a weight. And if you look ahead to the third version, it includes an L1 penalty, so often when you have a feature existing, you may still wind up with a 0 weight unless there is enough support for it. For training, that's a matter of math more than memory; but if lock-in a model and move it to a production setting, it helps to store only those weights that you need for a smaller model.

I've seen statements such as "often your hash space is only 30% utilized" or something like that.

rae89 commented 9 years ago

And by "D has at least one feature" do you mean D has at least one hashed value of key+value?

mlandry22 commented 9 years ago

Yes. Most values for D will never have a feature (i.e. never be the result of hashing key+value %D), since the general idea is to make D is high as you can fit in memory (I think I used 2^27 or 2^28 with 8GB RAM).

mlandry22 / kaggle

Modern logistic regression - fast_solution question #3

build x