hunar4321 / RLS-neural-net

Recursive Leasting Squares (RLS) with Neural Network for fast learning
MIT License
50 stars 9 forks source link

MNIST dataset eval #3

Open snapo opened 1 year ago

snapo commented 1 year ago

Very interesting behaviour...

When X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.94, random_state=42)

using only 6% to learn (with the one-hot encoded data) and 94% as test... we still get 81% accuracy... This is an absolut amazing result!

It seems Method 1 > Method 3 > Method 2

I run multiple different tests and all came up with the same result that Method 1 is the best method of those 3.

It is also (afaik) a record from all i remember on MNIST to get 81% accuracy with a single threaded cpu within 15 seconds!!! Never saw this before (in python , not c)...

Will probably also try it with CIFAR 10 and see how well it does...

Result:

----------------------------
Size of training set: 4200
Size of testing set: 65800
----------------------------
De-correlating all the xs with each other
----------------------------
Method 1. regression on ys using multiple y_classes in the form of one_hot matrix
train accuracy: 0.9147619047619048
test accuracy: 0.8130851063829787
---------------------------------
Method 2. regression on ys with simple rounding & thresholding of the predicted y classes.....
train accuracy: 0.27404761904761904
test accuracy: 0.2289209726443769
---------------------------------
Method 3. regression on ys using multiple y_classes in the form of random vectors (embeddings)
train accuracy: 0.7904761904761904
test accuracy: 0.6704103343465045
hunar4321 commented 1 year ago

Thanks for the addition You can increase the performance further by passing the data through a non-linear layer of weights e.g Relu. I have added the following lines to the beginning of the code which improves the performance of method 1 to 0.91 on the testing set. The more nodes, the better performance (with 1000 nodes you can reach 95%) but the computational run becomes exponentially high due to the quadratic nature of "xs" de-correlation. Also the risk of over-fitting increases with more nodes


def activate(w, x):
    linear = x = w.T @ x    
    # out = linear
    out = np.maximum(linear, 0) #relu    
    return out

nodes = 500
w = np.random.randn(xs.shape[0], nodes)
xs = activate(w , xs)

X_train = activate(w , X_train.T).T
X_test = activate(w , X_test.T).T

Ps. Also you can increase the performance of Method 3 if you increase the embed_size