[Chris Dyer, Yannis Assael, Brendan Shillingford]
TED stands for “Technology, Entertainment, and Design”. Each talk in the corpus is labeled with a series of open labels by annotators, including the labels “technology”, “entertainment”, and “design”. Although some talks are about more than one of these, and about half aren’t labeled as being about any of them! In this assignment, you will build a text classification model that predicts whether a talk is about technology, entertainment, or design--or none of these.
This is an instance of what is called “multi-label classification” (MLC), where each instance may have many different labels. However, we will start off by converting it into an instance of multi-class classification, where each document receives a single label from a finite discrete set of possible labels.
To answer the following questions you are allowed to use any machine learning framework of your taste. The practical demonstrators can provide help for:
Other suggested frameworks: CNTK, Torch, Caffe.
You should reserve the first 1585 documents of the TED talks dataset for training, the subsequent 250 for validation, and the final 250 for testing. Each document will be represented as a pairs of (text, label).
Using the training data, you should determine what vocabulary you want for your model. A good rule of thumb is to tokenise and lowercase the text (you did this in the intro practical).
At test time, you will encounter words that were not present in the training set (and they will therefore not have an embedding). To deal with this, map these words to a special
Each document should be labeled with label from the set: {Too, oEo, ooD, TEo, ToD, oED, TED, ooo}. You are called to generate labels from the \<keywords> tag by checking the existence of one of the following tags: {Technology, Entertainment, Design}.
A simple multilayer perceptron classifier operates as follows:
x = embedding(text)
h = tanh(Wx + b)
u = Vh + c
p = softmax(u)
if testing:
prediction = arg maxy’ py’
else: # training, with y as the given gold label
loss = -log(py) # cross entropy criterion
We will discuss the embedding function that represents the text as a vector (x) is discussed below. W and V are appropriately sized matrices of learned parameters, b and c are learned bias vectors. The other vectors are intermediate values.
The text embedding function converts a sequence of words into a fixed sized vector representation. Effective models for representing documents as vectors is an open area of research, but in general, trying a few different architectures is important since the optimal architecture depends both on the availability of data and the nature of the problem being solved.
An astoundingly simple but effective embedding model is the “bag-of-means” representation. Let each word wi in the document (where i ranges over the tokens) be represented by an embedding vector xi. The bag of means representation is
x = (1/N) sumi xi.
Word embeddings can be learned as parameters in the model (either starting from random values or starting from a word embedding model, such as word2vec or GloVe), or they you can use fixed values (again, word2vec or Glove).
A more sophisticated model uses an bidirectional RNN (/LSTM/GRU) to “read” the document (from left to right and from right to left), and then represents the document by pooling the hidden states across time (e.g., by simply taking their arithmetic average or componentwise maximum) and using that as the document vector. You can explore this next week for your practical assignment on RNNs.
You are called to build a single-layer feed-forward neural network in your favourite framework. The network should treat the labels as 8 independent classes. We suggest Adam as optimiser, and training should place in batches for increased stability (e.g.~50).
Try the same prediction task using a true multi-label classification (MLC) set up.
On paper, show a practical demonstrator your response to these to get signed off.