Open marin-hyatt opened 6 years ago
Some ideas:
I have a problem that I don't know how to fix or why it's happening. The second cell is failing in both of my notebooks, which shouldn't be happening because it has been working the whole time. I tried a few more algorithms like k-nearest neighbors, gaussian regression, support vector regression, but none of them worked better than the linear regression, so I gave up and started to remove the outliers. However, I couldn't run the cell because the program failed to read the file! Do you have any idea what could be happening?
Hmmm it looks like there's a few random characters thrown in on line 10 of the .csv file. Maybe some text ended up there instead of another window at some point? I think manually remove the letters should fix the problem, and we'll investigate further if it happens again. I'm really looking forward to seeing your new results!
Hi, I'm a little stumped because I tried ridge regression and it didn't go so well. I've tried a ton of algorithms and they don't seem to fit very well even if it seems like they should. Is there a factor I'm not considering?
Heh, almost certainly! Each technique makes assumptions, and not every dataset is compatible with any given approach -- that's why there are so many methods out there, because it's not a one-size-fits-all situation. Let's go through the ways in which each algorithm fails to get intuition for what their strengths and weaknesses are and see what about this dataset might be problematic or how we can adapt to make them work better.
I'm a little confused, because the linear regression function isn't creating an individual line for each graph. It's just fitting a line to the first graph and plotting the same line for the rest.
nm_x_names = ['a', 'b', 'c', 'd']
chi_square_error_list_2 = []
for i in range(len(nm_x_names)):
name = nm_x_names[i]
plt.scatter(nm_df[name], y, color='black',s=1, alpha=0.5)
plt.plot(np_x_for_graph_2.T[i], y_for_graph_2, color='blue', linewidth = 2)
plt.xlabel(name)
plt.ylabel("Redshift")
plt.show()
print(chi_square_error(y, np.dot(coefficients_2, A_2)))
I don't know how to attach the output but it's all the same line and the same chi square error, which makes no sense.
EDIT: I fixed this problem but now the lines are inaccurate. I think this is the same problem I had with scikit learn.
Eureka, I figured out why the projected lines don't look right! To prepare for Tuesday, could you make plots with all pairs of x components showing the points for the data as well as the lines from the gridded x vectors we used for plotting? And how do you feel about mathematical proofs?
That's awesome! What exactly do you mean by pairs of x components? Also are you saying that I should keep the lines from one of my previous graphs using the x vectors. Sorry I'm a little confused. Anyway, I'm comfortable with proofs. I'm not sure I could understand a complex proof on the first go or write my own, but I've worked with them before.
Ah, by pairs of x components, I mean a vs. b, a vs. c, a vs. d, b vs. c, b vs. d, and c vs. d, in your naming convention, as well as the corresponding components of the linear gridded points made for plotting (x\_for\_plot
, or something like that).
OK, I committed something. Do you mind checking it over to make sure I plotted the right data? But anyway, the graphs certainly look interesting!
I've hit a small obstacle. I don't know how to input information to the SDSS website to get the redshift in my data. I forgot to write down exactly how I did it last time, and the website doesn't have a clear button for it or anything. Do I have to change the SQL query?
Yes, you'll have to change the query. There are instructions for constructing a query here, with more links in the menu on the left. Feel free to experiment with it to figure out what to do, since the documentation is quite dry.
I was researching the k nearest neighbors algorithm to try to familiarize myself. I came across a helpful article which I put in the document, but now I have a question which may be pretty easy to answer but I just want to make sure. Should I be using the regressor or the classifier? The data has a ton of noise which discourages me from the regressor, but I don't really know how the classifier would be useful since the data isn't really organized into categories.
Find out how to make a better linear regression. This might include using filtered data or using a different algorithm.