Unsupervised learning - Githubissues

shuijian-xu commented 5 years ago

Unsupervised ML, by contrast, doesn’t use labeled data.

shuijian-xu commented 5 years ago

Image or face recognition is a good example of unsupervised ML in action. A feature vector describing the object or the face is produced, and the model needs to identify which known face or known object is the best match. If no good match is found, the new object or new face can be added to the model in real time, and future objects can be matched against this one. This is an example of unsupervised ML because the objects being recognized are not labeled, the ML model decides by itself that an object is new and should be added, and it determines by itself how to recognize this object.

shuijian-xu commented 5 years ago

Continuous: Real-Time Clustering

Clustering is the task of grouping a set of data points into a group (called a cluster) such that each group consists of data points that are similar. A distance function determines the similarity between data points.

You can use the centroid-based clustering (k-means) technique to learn the center position of each cluster, called a centroid. To classify a data point, you compare it to each centroid, and assign it to the most similar centroid. When a data point is matched to a centroid, the centroid moves slightly toward the new data point. You can apply this technique to continuously update the clusters in real time, which is useful because it allows the model to constantly be updated to reflect new data points.

An example application of real-time clustering is to provide customized content recommendation, such as news articles or videos. Users and their viewing history are grouped into clusters. Recommendations can be given to a user by inspecting which videos the cluster watched that the user has not yet watched. Updating the clusters in real time allows those applications to react to changes in trends and provide up-to-date recommendations:

SELECT cluster_name, euclidean_distance(vector, json_array_pack('[1, 3, 5, ...]') as distance FROM clusters ORDER BY distance ASC LIMIT 1;

+--------------+--------------------+ | cluster_name | distance | +--------------+--------------------+ | cluster_1 | 2.8284271247461903 | +--------------+--------------------+

shuijian-xu commented 5 years ago

Categorical: Real-Time Unsupervised Classification with Neural Networks

Neural networks are a very powerful tool to solve classification problems. They are composed of neurons, divided into layers. Typically, a neural network consists of an input layer, one or more hidden layers, and an output layer. Each layer uses the previous layer as an input. The final output layer would contain one neuron per possible category. The value of each output neuron determines whether an input belongs to a specific category. Image recognition is a good example of an application for neural networks. In those networks, the pixels of the image are fed in the neural network, and there is one output neuron for each object to be recognized. In the context of real-time ML, neural networks can be used to classify data points in real time. Neurons in the output layer can also be added in real time, in the context of unsupervised learning, to give the network the capacity to recognize an object it had not seen before, as illustrated in Figure 7-2.

The neurons of the output layer can be stored as vectors (one row/vector per output neuron) in a table. The following example illustrates how to evaluate the output layer of a neural network in a real-time database for facial recognition. We need to send a query to the database to determine which neurons in the output layer have a high output. The neurons with a high output determine which faces would match the input.

In the code example that follows, the output of the last hidden layer of the facial recognition neural network is passed in as a vector using json_array_pack:

SELECT dot_product(neuron_vector, json_array_pack('[0, 0.4, 0.6 0,...]') as dot, id FROM neurons HAVING dot > 0.75;

You can use publicly available neural networks for facial recognition (one such example can be found at http://www.robots.ox.ac.uk/~vgg/software/vgg_face/).

Using a database to store the neurons provides the strong advantage that the model can be incrementally and transactionally updated concurrently with evaluation of the model. If no good match was found in the preceding example, the following code illustrates how to insert a neuron in the output layer to recognize this pattern if it were to come again:

INSERT INTO neurons (id, neuron_vector) SELECT 'descriptor for this face', json_array_pack('[0, 0.4, 0.6 0,...]');

If the algorithm was to see the face again, it would know it has seen it before. This example demonstrates the role that infrastructure plays in allowing ML applications to operate in real time. There are countless ways in which you might implement a neural network for facial recognition, or any of the ML techniques discussed in this section. The examples have focused on pushing computation to a database in order to take advantage of distributed computation, data locality, and the power of SQL. The next chapter discusses considerations for building your application and data analytics stack on top of a distributed database.

shuijian-xu / hive

Unsupervised learning #3