Improving Linear Regression

marin-hyatt commented 6 years ago

Find out how to make a better linear regression. This might include using filtered data or using a different algorithm.

aimalz commented 6 years ago

Some ideas:

[x] feature selection: Perhaps the magnitudes themselves aren't the best predictor of the redshift. Sometimes other functions of the magnitudes (individually or in combination) can be better predictors of the data, so we can try some of those.
[ ] non-linear regression: Try fitting a different function to the data, especially if there are patterns in the residuals (y_pred - y_true).
[x] outlier detection: You can see whether any data points are disproportionately influencing your fit and then see if it improves after removing them.

marin-hyatt commented 6 years ago

I have a problem that I don't know how to fix or why it's happening. The second cell is failing in both of my notebooks, which shouldn't be happening because it has been working the whole time. I tried a few more algorithms like k-nearest neighbors, gaussian regression, support vector regression, but none of them worked better than the linear regression, so I gave up and started to remove the outliers. However, I couldn't run the cell because the program failed to read the file! Do you have any idea what could be happening?

aimalz commented 6 years ago

Hmmm it looks like there's a few random characters thrown in on line 10 of the .csv file. Maybe some text ended up there instead of another window at some point? I think manually remove the letters should fix the problem, and we'll investigate further if it happens again. I'm really looking forward to seeing your new results!

marin-hyatt commented 6 years ago

Hi, I'm a little stumped because I tried ridge regression and it didn't go so well. I've tried a ton of algorithms and they don't seem to fit very well even if it seems like they should. Is there a factor I'm not considering?

aimalz commented 6 years ago

Heh, almost certainly! Each technique makes assumptions, and not every dataset is compatible with any given approach -- that's why there are so many methods out there, because it's not a one-size-fits-all situation. Let's go through the ways in which each algorithm fails to get intuition for what their strengths and weaknesses are and see what about this dataset might be problematic or how we can adapt to make them work better.

marin-hyatt commented 6 years ago

I'm a little confused, because the linear regression function isn't creating an individual line for each graph. It's just fitting a line to the first graph and plotting the same line for the rest.

nm_x_names = ['a', 'b', 'c', 'd']
chi_square_error_list_2 = []
for i in range(len(nm_x_names)):
    name = nm_x_names[i]

    plt.scatter(nm_df[name], y, color='black',s=1, alpha=0.5)

    plt.plot(np_x_for_graph_2.T[i], y_for_graph_2, color='blue', linewidth = 2)
    plt.xlabel(name)
    plt.ylabel("Redshift")
    plt.show()

    print(chi_square_error(y, np.dot(coefficients_2, A_2)))

I don't know how to attach the output but it's all the same line and the same chi square error, which makes no sense.

EDIT: I fixed this problem but now the lines are inaccurate. I think this is the same problem I had with scikit learn.

aimalz commented 6 years ago

Eureka, I figured out why the projected lines don't look right! To prepare for Tuesday, could you make plots with all pairs of x components showing the points for the data as well as the lines from the gridded x vectors we used for plotting? And how do you feel about mathematical proofs?

marin-hyatt commented 6 years ago

That's awesome! What exactly do you mean by pairs of x components? Also are you saying that I should keep the lines from one of my previous graphs using the x vectors. Sorry I'm a little confused. Anyway, I'm comfortable with proofs. I'm not sure I could understand a complex proof on the first go or write my own, but I've worked with them before.

aimalz commented 6 years ago

Ah, by pairs of x components, I mean a vs. b, a vs. c, a vs. d, b vs. c, b vs. d, and c vs. d, in your naming convention, as well as the corresponding components of the linear gridded points made for plotting (x\_for\_plot, or something like that).

marin-hyatt commented 6 years ago

OK, I committed something. Do you mind checking it over to make sure I plotted the right data? But anyway, the graphs certainly look interesting!

marin-hyatt commented 6 years ago

I've hit a small obstacle. I don't know how to input information to the SDSS website to get the redshift in my data. I forgot to write down exactly how I did it last time, and the website doesn't have a clear button for it or anything. Do I have to change the SQL query?

aimalz commented 6 years ago

Yes, you'll have to change the query. There are instructions for constructing a query here, with more links in the menu on the left. Feel free to experiment with it to figure out what to do, since the documentation is quite dry.

marin-hyatt commented 6 years ago

I was researching the k nearest neighbors algorithm to try to familiarize myself. I came across a helpful article which I put in the document, but now I have a question which may be pretty easy to answer but I just want to make sure. Should I be using the regressor or the classifier? The data has a ton of noise which discourages me from the regressor, but I don't really know how the classifier would be useful since the data isn't really organized into categories.

marin-hyatt / redshift_data_project

Improving Linear Regression #5