How should one modify the code to successfully run text classification?

davidefiocco commented 6 years ago

Hi,

I am new to PyTorch (but still more at ease with it than TF) so I thought to experiment with @thomwolf 's implementation in this repo (thanks for sharing it!!)

I would like to try out the code to perform binary text classification of text snippets, similar to the classification tasks such as the Corpus of Linguistic Acceptability (CoLA) and the Stanford Sentiment Treebank (SST-2) in the original reference.

These are the steps that I think are needed to get the code working (but I am not sure that these are correct and/or exhaustive):

Create two sets snippets_val.csv and snippets_test.csv containing two columns, text (string) and class (an int equal to 0 or 1).
In datasets.py create two new functions:
- _snippets returning two lists st, y, and
- snippets defined with different values of n_train and n_valid and whose return statement looks like return (trX, trY), (vaX, vaY), (teX, )
In train.py, rewrite transform_roc into a transform_snippet that doesn't use [delimiter] and takes only one argument in input <- somewhat tricky to me can anyone provide some guidance?
In train.py, in the encoding bit and afterwards:
- modify the tuple in output of encode_dataset to match the output of the function of snippets redefined above.
  - get rid of encoder['_delimiter_'] = len(encoder)
  - set n_special = 2 as we got rid of ['_delimiter_']
  - get rid of the vars containing 2 and 3 in their name (?) e.g. in the definition of n_ctx <- somewhat tricky to me can anyone provide some guidance?
In train.py:
- modify the call to dh_model to use ('classification', 2) instead of 'multiple_choice'
- use (unless it's bugged!) ClassificationLossCompute instead of MultipleChoiceLossCompute
In analysis.py:
- create a new function snippets so to invoke _snippets (from datasets.py) and read in snippets_test.csv and adjust its call to _snippets so to take into account that it outputs two lists (not 4)
Modify imports in train.py coherently with all of the above.

Does all of the above make sense as a plan, or can somebody fill missing bits or provide an alternative list of "sub-steps" ? Also, can someone provide some guidance on how to rewrite transform_roc (comments on the original code would be fantastic, I am glad to annotate the original function and contribute to the repo as a result of this!)

Thanks to anyone patiently reading this!

lordzuko commented 6 years ago

@davidefiocco I have made the modifications in transform_roc method as below for the entailment task, which is a classification problem:

def transform_roc(X1, X2, X3):
    n_batch = len(X1)
    xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
    mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
    start = encoder['_start_']
    delimiter = encoder['_delimiter_']
    for i, (x1, x2, x3), in enumerate(zip(X1, X2, X3)):
        x12 = [start] + x1[:max_len] + [delimiter] + x2[:max_len] + [clf_token]
        x13 = [start] + x1[:max_len] + [delimiter] + x3[:max_len] + [clf_token]
        l12 = len(x12)
        l13 = len(x13)
        xmb[i, 0, :l12, 0] = x12
        xmb[i, 1, :l13, 0] = x13
        mmb[i, 0, :l12] = 1
        mmb[i, 1, :l13] = 1
    # Position information that is added to the input embeddings in the TransformerModel
    xmb[:, :, :, 1] = np.arange(n_vocab + n_special, n_vocab + n_special + n_ctx)
    return xmb, mmb

You can refer to my question in #40 . Hope this helps.

davidefiocco commented 6 years ago

@lordzuko thanks!

I had seen https://github.com/huggingface/pytorch-openai-transformer-lm/issues/40 and that's excellent guidance for me on how that transform function should change depending on the task. However, I thought that given that in the README there is an "architectural" difference between classification and entailment (see https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/assets/ftlm.png) edits should be slightly more profound with respect to the original transform_roc and I should use a function of the form

def transform_snippet(X1):
    n_batch = len(X1)
    xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
    mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
    start = encoder['_start_']
    for i, (x1), in enumerate(X1):
        x12 = [start] + x1[:max_len] + [clf_token]
        l12 = len(x12)
        xmb[i, 0, :l12, 0] = x12
        mmb[i, 0, :l12] = 1
    # Position information that is added to the input embeddings in the TransformerModel
    xmb[:, :, :, 1] = np.arange(n_vocab + n_special, n_vocab + n_special + n_ctx)
    return xmb, mmb

but this is not working as intended as of now..

lordzuko commented 6 years ago

@davidefiocco Are you getting any error? or the model is not training as intended, if you are getting an error, can you please share the error-trace, please.

davidefiocco commented 6 years ago

@lordzuko I have tried implementing the steps that I described in https://github.com/huggingface/pytorch-openai-transformer-lm/issues/43#issue-362661674 (hoping they make sense...) and got

Traceback (most recent call last):
File "train.py", line 218, in <module>
    teX, teM = transform_snippet(teX1)
File "train.py", line 29, in transform_snippet
    xmb[i, 0, :l12, 0] = x12
ValueError: setting an array element with a sequence.

I can try to make this reproducible is to provide more code/publish a fork as I modified the current code in several parts (see my comment above for the full list) when trying to implement the head change (https://github.com/huggingface/pytorch-openai-transformer-lm/issues/43#issuecomment-424622723 is one of the changes). I am not very proficient in PyTorch yet though (so they may be clumsy changes), that's why the questions in https://github.com/huggingface/pytorch-openai-transformer-lm/issues/43#issue-362661674 . Most likely, the transform_snippet function posted in https://github.com/huggingface/pytorch-openai-transformer-lm/issues/43#issuecomment-424622723 is not OK.

thomwolf commented 6 years ago

Hi @davidefiocco, Your transform_snippet function should be the way to go. I think it's just a python typo. Looks like your l12 is equal to one. Probably comes from this line: for i, (x1), in enumerate(X1). Try using for i, x1 in enumerate(X1)

davidefiocco commented 6 years ago

Hi @thomwolf, thanks for your reply and tip!

As advertised I forked the code, and you find the result at https://github.com/huggingface/pytorch-openai-transformer-lm/compare/master...davidefiocco:master and that specific edit can be found at https://github.com/davidefiocco/pytorch-openai-transformer-lm/blob/e9945725603544cdebaec91937d4a16f14db0ad8/train.py#L26

In the fork namings ,news stands for "newsgroup", as I tried to classify snippets of text coming from a (2-newsgroup) subset of the 20 newsgroup dataset (http://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). I haven't been successful in using the algorithm yet (the code now runs without errors, but iterations don't seem to converge).

I will update this issue if I manage to get it sorted, and if someone is keen on giving feedback on what needs to be changed in the code I'll be very happy to work on it.

davidefiocco commented 6 years ago

I had another bug, which I think I fixed with https://github.com/davidefiocco/pytorch-openai-transformer-lm/commit/d546da7c7076fac73d8fc850b2d0066edc36680c

And I seem to converge and reproduce the 91+% evaluation accuracy on SST-2.

I am still not sure that everything is really fine, but it seems to converge at least now!

huggingface / pytorch-openai-transformer-lm

How should one modify the code to successfully run text classification? #43