Possible Incorrect Solution?

test_idx = np.random.randint(data.shape[0], size=round(546*.2))
training_idx = np.random.randint(data.shape[0], size=round(546*.8))

From what I can tell... in Step 2, when we're splitting the data into training and test sets, we generate a random set of indices that are not necessarily unique with respect to each other! This means that we can (and are likely to) get overlapping training and test sets which is incorrect! Furthermore, the np.random.randint function samples from the range 0, 546 with replacement... meaning that we can get repetitions in the test and training sets (and so theoretically in the worst possible case we could end up with identical test and training sets).

Quickest fix would be to replace the assignments to test_idx and training_idx with:

indices = np.arange(data.shape[0])
num_choices = round(data.shape[0]*0.2)
test_idx = np.random.choice(indices, size=num_choices, replace=False)
training_idx = np.array([x for x in indices if x not in test_idx])

or something like that so as to remove the possibility of intersection and ensure we're using all the data!

learn-co-curriculum / dsc-2-13-15-linalg-regression-lab

Possible Incorrect Solution? #1