nlpodyssey / spago

Self-contained Machine Learning and Natural Language Processing library in Go
BSD 2-Clause "Simplified" License
1.75k stars 86 forks source link

Multi-label BERT classifier from PyTorch #100

Closed jimidle closed 3 years ago

jimidle commented 3 years ago

So I can convert and then load my BERT model, but I am having troubling working out how to operate it from Spago.

It is a multi-label model and to use it in Python I do this:

    text_enc = bert_tokenizer.encode_plus(
            texttoclassify,
            None,
            add_special_tokens=True,
            max_length=MAX_LEN,
            padding='max_length',
            return_token_type_ids=False,
            return_attention_mask=True,
            truncation=True,
            return_tensors='pt'
    )

    # mymodel implements pl.LightningModule
    #
    outputs = mymodel(text_enc['input_ids'], text_enc['attention_mask'])
    pred_out = outputs[0].detach().numpy()

And then process the pred_out array. This model has 5 outputs and all works as you expect in Python.

So, how do I perform the equivalent in Spago? Borrowing code from the classifier server, I have got this far, but it just isn't obvious what I need to modify to cater for 5 output label layer.


func getTokenized(vocab *vocabulary.Vocabulary, text string) []string {
    cls := wordpiecetokenizer.DefaultClassToken
    sep := wordpiecetokenizer.DefaultSequenceSeparator
    tokenizer := wordpiecetokenizer.New(vocab)
    tokenized := append([]string{cls}, append(tokenizers.GetStrings(tokenizer.Tokenize(text)), sep)...)
    return tokenized
}

// ....
    model, err := bert.LoadModel(dir)
    if err != nil {
        log.Fatalf("error during model loading (%v)\n", err)
    }
    defer model.Close()

    // We need a classifier that matches the output layer of our model.
    //
    var bc = bert.ClassifierConfig{
        InputSize: 768,
        Labels:    []string{"A", "B", "C", "D", "E"},
    }
    model.Classifier = bert.NewTokenClassifier(bc)

    tokenized := getTokenized(model.Vocabulary, s)

    g := ag.NewGraph(ag.ConcurrentComputations(runtime.NumCPU()))
    proc := nn.ReifyForInference(model, g).(*bert.Model)
    encoded := proc.Encode(tokenized)

    logits := proc.SequenceClassification(encoded)
    probs := floatutils.SoftMax(logits.Value().Data())

However, this just gives me 0.2 for each, so I seem to be miles off. Is there an example, or can a short code sequence be provide? Is the wordpiecetokenizer even the correct thing to use?

matteo-grella commented 3 years ago

Hey @jimidle,

here is an application of BERT to text classification.

In the example I used the ProsusAI/finbert, a fine-tuned BERT model to analyze sentiment of financial text. It is a multi-label classifier with 3 classes (positive, negative, neutral) but the example should scale with "any" number of classes.

Let me know if you get the desired results with this code or if you are still having problems. I'm glad to help you :)

jimidle commented 3 years ago

Thank you for the pointer Matteo. This code at least shows me that I was doing the right thing Spago-wise.

However, after loading my model, the Classifier.Config only contains the two default labels LABEL_0 and LABEL_1. So, I suspect that my previous question about saving the my fine-tuned Lightning model in a form that Spago can load is the key here. I must not be saving it correctly.

I wonder if my Python code is saving the starting model (bert-base-cased) and not my fine tuned version. I am not a big fan of Python and it is difficult to trace through what it is doing. I will dig in to this - the documentation for the Python code is all over the place and I can't find anything that says "how to save a Lightning model after you have trained it, in the usual BERT model form" - the code just seems to expect that when you have use Lightning, then you will load it back in its checkpoint format.

I will keep looking at that aspect. If you have any clues on that, then I am happy to receive them.

jimidle commented 3 years ago

Thanks Matteo - that hint got me on the right path and it was indeed an issue with not exporting the right model from the Python code. Specifically PyTorch Lightning. Then I had to fix the config.json output. I get the same labels from my Spago code as the Python code, though it is a little different in processing, which is fine, as teh Spago output is more what I want.

Anyway, I know what I am doing now and I am in a position to start helping with documentation I think.

My only concern is that the proc.Encode(tokenized) takes a long time. I will do some code investigation. Perhaps there are some optimizations that can be done, or perhaps that is just how long this operation takes.