learn-co-curriculum / dsc-2-13-15-linalg-regression-lab

Other
0 stars 5 forks source link

Possible Incorrect Solution? #1

Open tmgrgg opened 5 years ago

tmgrgg commented 5 years ago
test_idx = np.random.randint(data.shape[0], size=round(546*.2))
training_idx = np.random.randint(data.shape[0], size=round(546*.8))

From what I can tell... in Step 2, when we're splitting the data into training and test sets, we generate a random set of indices that are not necessarily unique with respect to each other! This means that we can (and are likely to) get overlapping training and test sets which is incorrect! Furthermore, the np.random.randint function samples from the range 0, 546 with replacement... meaning that we can get repetitions in the test and training sets (and so theoretically in the worst possible case we could end up with identical test and training sets).

Quickest fix would be to replace the assignments to test_idx and training_idx with:

indices = np.arange(data.shape[0])
num_choices = round(data.shape[0]*0.2)
test_idx = np.random.choice(indices, size=num_choices, replace=False)
training_idx = np.array([x for x in indices if x not in test_idx])

or something like that so as to remove the possibility of intersection and ensure we're using all the data!

Redaisy commented 5 years ago

Yeah, I noticed this too! Glad I'm not crazy.