Simplifying the use of the model to perform different tasks

This patch contains the code discussed about in #24.

Its goal is to simplify the allow users to use this model for many different tasks, as presented in the research paper. For example, let's say you want to finetune the network to classify texts, you just have to create a DoubleHeadModel with a classification head and use the ClassificationLossCompute class.

The SimilarityHead has not been tested yet and the SimilarityLossCompute is missing as I don't know how this kind of task works.

huggingface / pytorch-openai-transformer-lm

Simplifying the use of the model to perform different tasks #25