Test the correctness of Gradient Descent implementation, used in the "online" LR...
The results may vary based on:
Using Batch (one matrix/vector computation (iteration) for the whole dataset) vs Stochastic (one iteration for each sample) Gradient Descent
Doing multiple iterations of the whole dataset or just 1 (makes more sense for Batch GD)
Dividing by the number of samples (2* # of samples -- for convenience)
Taking the norm instead of the norm squared (this could actually be harmful, as the function then becomes less sensitive and less smooth (|x| vs x^2))
Setting learning rate too big (diverges) or too small (never converges close enough)
Choosing Alpha with the best MSE instead of the last Alpha
Shuffling the dataset first
Modifying the input dataset before computing (e.g. some graphs show an exponential dependency; when the numbers are big but with small variance (years), it can also cause problems)
Test the correctness of Gradient Descent implementation, used in the "online" LR...
The results may vary based on: