joseph-g25 / SP22-Project3-WineRedux

Second version of a machine-learning based wine quality classifier.
0 stars 0 forks source link

Assignment description #3

Open joseph-g25 opened 2 years ago

joseph-g25 commented 2 years ago

From Dr. Yarnall Canvas assignment:

Do the following:

    Think about the wine data, and select a few (3-5, maybe; experiment) features.
    Perform KMeans to cluster the data using only these features as input to the clustering algorithm. Use the silhouette score to find the best number of clusters. 
    Create a new dataset; drop the columns you selected and add a collection of new columns -- each instance should get new features recording its distance to each of the centroids in your clustering (so, you'll get one new feature per cluster).
    Now train a classifier. Does the classifier work better, or worse?
    Again, the goal here is to do this work properly; I'm concerned with the technique, not the outcome.

In an initial script, you should pick your features to replace and then run the code needed to figure out how many clusters to use.

Your final script should

Load the data, split off test and train sets. Create a pipeline that does any preliminary data preparation (don't forget to scale). clusters (using the number of clusters you determined during your investigation stage). Note: you'll want to use a ColumnTransformer so that the output of kmeans (the distances) replace only the selected features does your classification. fit your transformer to the training data. Test your model on the test data. Remember to compare to a classifier built without the clustering. If you want, you can try another experiment -- first, apply PCA and select a handful of principal components. Then, do clustering in THAT space. This is a common technique when the number if dimensions is quite high.

joseph-g25 commented 2 years ago

Key objectives:

Experiment with KMeans clustering, and dropping various features.

Create a final pipeline Include data preparation steps Includes a clustering algorithm (use ColumnTransformer to replace selected features) Fits transformer to training data Tests model on test data

*Compare to a model without clustering algorithm