Python machine learning library based on Object Oriented design principles; the goal is to allow users to quickly explore data and search for top machine learning algorithm candidates for a given dataset
[x] validate NO NAs (i doubt sklearn allows NAs, R package does not)
[x] allow for dynamic scaling (e.g. center-scale, or normalization), but it should actually be a parameter of the class
[x] for heatmap, allow colors to be based off of either median or mean (in r-tools, red means it is above mean, and blue means it is below)
# A place for your work - create a scree plot - you will need to
# Fit a kmeans model with changing k from 1-10
# Obtain the score for each model (take the absolute value)
# Plot the score against k
def get_kmeans_score(data, center):
'''
returns the kmeans score regarding SSE for points to centers
INPUT:
data - the dataset you want to fit kmeans to
center - the number of centers you want (the k value)
OUTPUT:
score - the SSE score for the kmeans model fit to the data
'''
#instantiate kmeans
kmeans = KMeans(n_clusters=center)
# Then fit the model to your data using the fit method
model = kmeans.fit(data)
# Obtain a score related to the model fit
score = np.abs(model.score(data))
return score
scores = []
centers = list(range(1,11))
for center in centers:
scores.append(get_kmeans_score(data, center))
plt.plot(centers, scores, linestyle='--', marker='o', color='b');
plt.xlabel('K');
plt.ylabel('SSE');
plt.title('SSE vs. K');
from
udacity -> Data Scientist NanoDegree -> Part 3 - Unupervised Learning -> Clustering
NAs
(i doubt sklearn allows NAs, R package does not)