Open mblohr opened 9 years ago
Supervised learning. Let's say we have 100 connectomes. 50 are male, 50 are female. (The data is labeled). We want our machine learning algorithm to be able to recognize male connectomes from female connectomes.
Let's separate the 100 connectomes into two groups, our test data and validation data. We will use our test data to generate a model which we want to accurately predict the gender of connectomes in the validation group.
Maybe there's some graph metric that we can use that can give us insight into the gender of a connectome. In our supervised learning, we could let the algorithm use this graph metric on the test data, and see if it's able to find some line to be able to separate the male and female brains. (Look at all the pretty pictures on the article for support vector machines to get an illustration. http://en.wikipedia.org/wiki/Support_vector_machine).
Once we've run the algorithm on the data, we will give it the other 50 and see if it's able to properly classify them to whatever tolerance we want.
In supervised learning, we have a machine learning algorithm (see http://en.wikipedia.org/wiki/Supervised_learning for a list of algorithms) that is going to try to find a way to classify new data based on labels. Various algorithms are good for various types of tasks, and they will decompose the test data in different ways, but I'm not really knowledgeable in machine learning so I can't give any insight into the pros and cons of various algorithms.
Unsupervised learning. This is where stuff like k-means comes in. We don't have labels so we can't "tell" our algorithm what to look for. Instead, the algorithm finds some sort of structure or pattern in the data. You could have it look for the most connected nodes and have them be a center of a cluster.
Let's say we want to cluster nodes in a single connectome. What we really want is to have nodes that have lots of connections with each other to all belong to the same cluster. Maybe in this particular case we can consider the clusters to be brain regions or something. When our unsupervised learning algorithm places nodes into a cluster, it's basically saying that they're similar.
While in the supervised learning case we were able to know if the learning algorithm had properly assigned data, in the unsupervised case we don't know if it's correct or not.
Semi-supervised learning apparently is a thing. http://en.wikipedia.org/wiki/Semi-supervised_learning
What it sounds like it does is it has a small amount of labeled data in the test set, and the rest of the test data is unlabeled. And apparently having the small amount of labeled data makes it much better at classifying than if the data was all unlabeled, since it has a tiny bit of an idea of where stuff should go (except it doesn't really have an idea, because it's just regression).
To even further complicate things. There's also such thing as soft-supervision in which instead of given hard set labels for training examples, the learner is given a likelihood of an example being in a each class /label. http://gradworks.umi.com/33/77/3377083.html
Can someone provide an explanation, with examples, of supervised versus unsupervised learning/clustering/classification? Is there such a thing as semi-supervised learning?