natestemen / qml

Quantum Machine Learning group project ⚛️ 🤖
1 stars 2 forks source link

Data exploration #2

Closed natestemen closed 2 years ago

natestemen commented 2 years ago

Just want to summarize what I've looked at so far, and what I've found in one thread. This analysis was performed with data/feature_exploration.py and some slight modifications to generate some of the plots.

I chose to inspect the entire dataset (1371) entries since any pattern we want to exploit should also be found in our sampled dataset. (I guess this is not technically true, as our 100 data points is small and we might have sampled outliers, but what can you do).

First, we can scatter plot the data to get a sense of how the data might be separating among the features.

From here the "variance" seems to be our "principal component". We can also generate these scatter plots with 3 features at once.

This doesn't seem all that helpful on it's own, but @mmirkamali pointed out that the points that will be most difficult to classify have variance in the range [-2.5, 2] based on the 2D scatter plots. We can then create this same 3D scatter plot with only data points with variance in [-2.5, 2] to obtain

This shows there is a pretty clear 2D plane (in the first and last plot it's most apparent) that separates our data. Pretty nice! I understand this will help us pick better parameters for our classification problem, but I'm not sure what that process looks like from here. Perhaps someone can expand.

mmirkamali commented 2 years ago

Nice job Nate! The 3D plots with clear separation are very promising that the data can be classified with large enough size of training set.

I noticed that with all data points the overlap of variance for two classes is in the range (-3,2.5). The previous range was with the sample of size 100, I think. It might not effect the result very much and also it is easy to fix in your code ... . We are going to use 100 data points anyway, so it should be fine.