Data exploration - Githubissues

Just want to summarize what I've looked at so far, and what I've found in one thread. This analysis was performed with data/feature_exploration.py and some slight modifications to generate some of the plots.

I chose to inspect the entire dataset (1371) entries since any pattern we want to exploit should also be found in our sampled dataset. (I guess this is not technically true, as our 100 data points is small and we might have sampled outliers, but what can you do).

First, we can scatter plot the data to get a sense of how the data might be separating among the features.

From here the "variance" seems to be our "principal component". We can also generate these scatter plots with 3 features at once.

This doesn't seem all that helpful on it's own, but @mmirkamali pointed out that the points that will be most difficult to classify have variance in the range [-2.5, 2] based on the 2D scatter plots. We can then create this same 3D scatter plot with only data points with variance in [-2.5, 2] to obtain

This shows there is a pretty clear 2D plane (in the first and last plot it's most apparent) that separates our data. Pretty nice! I understand this will help us pick better parameters for our classification problem, but I'm not sure what that process looks like from here. Perhaps someone can expand.

natestemen / qml

Data exploration #2