stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.21k stars 885 forks source link

Inference with self trained langid model #1072

Open paulthemagno opened 2 years ago

paulthemagno commented 2 years ago

Hi, I trained a langid model with my dataset following these steps and ending with this method:

python -m stanza.models.lang_identifier --data-dir data  --eval-length 10 --randomize --save-name model.pt --num-epochs 100

At there is the .pt saved in the directory.

How I can test this new model, making some inferences on some inputs? I saw in the doc how to do that with the standard model, but not with new trained ones. Thank you!

paulthemagno commented 2 years ago

Probably the best way would be:

import stanza
nlp = stanza.Pipeline("multilingual", langid_model_path="model_stanza.pt")

am I right?

AngledLuffa commented 2 years ago

If it's loading, then I think that must be right...

Let us know if that works, and we'll update the docs!

On Thu, Jul 7, 2022 at 3:46 AM Paolo Magnani @.***> wrote:

Probably the best way would be:

import stanzanlp = stanza.Pipeline("multilingual", langid_model_path="model_stanza.pt")

am I right?

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1072#issuecomment-1177388286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLHTE26E7ET72DA6A3VS2YRHANCNFSM52Z37DSQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

J38 commented 2 years ago

Yes that is the proper way to load a custom model in Python code.

J38 commented 2 years ago

You can do a more comprehensive eval with this command:

python -m stanza.models.lang_identifier --data-dir data --load-model model.pt --mode eval --eval-length 50 --save-name model-results.jsonl

Full documentation here: https://stanfordnlp.github.io/stanza/langid.html

paulthemagno commented 2 years ago

It works! Is there also easy way to extract the accuracy for each language? In order to have something like {"en": 0.8, "de": 0.1, etc..}

paulthemagno commented 2 years ago

I've found this way to achieve this task. I wanted to arrive to _model output and have percentages that sum to 1 with a softmax layer.

To do that I overrided _process_list of LangIDProcessor class and prediction_scores of LangIDBiLSTM class. I'm not sure this it's the cleanest way to do that. Do you agree @AngledLuffa ?

Also for the config I have to pass to LangIDProcessor, I'm not sure to have build the object correctly.

import torch
from stanza.pipeline.langid_processor import LangIDProcessor
from stanza.models.common.doc import Document

#taken from https://discuss.pytorch.org/t/apply-mask-softmax/14212/13
def masked_softmax(vec, mask, dim=1):
    masked_vec = vec * mask.float()
    max_vec = torch.max(masked_vec, dim=dim, keepdim=True)[0]
    exps = torch.exp(masked_vec-max_vec)
    masked_exps = exps * mask.float()
    masked_sums = masked_exps.sum(dim, keepdim=True)
    zeros=(masked_sums == 0)
    masked_sums += zeros.float()
    return masked_exps/masked_sums

def get_predictions_scores(text, pipeline, k):
    print(f"Text: {text}")
    print(f"Output of the pipeline directly: {pipeline(text).lang}")

    config = {}
    for key in pipeline.config:
        if key.startswith("langid_"):
            config[key.split("langid_")[1]] = pipeline.config[key]
        else:
            config[key] = pipeline.config[key]
    processor = LangIDProcessor(config = config, pipeline = pipeline, use_gpu = True)

    docs = [text]

    #override of _process_list of LangIDProcessor
    if isinstance(docs[0], str):
        docs = [Document([], text) for text in docs]
    docs_by_length = {}
    for doc in docs:
        text = processor.clean_text(doc.text) if processor._clean_text else doc.text
        doc_length = len(text)
        if doc_length not in docs_by_length:
            docs_by_length[doc_length] = []
        docs_by_length[doc_length].append((doc, text))

    for doc_length in docs_by_length:
        inputs = [doc[1] for doc in docs_by_length[doc_length]]
        #override of prediction_scores of LangIDBiLSTM to get the predictions
        x = processor._text_to_tensor(inputs)
        prediction_probs = processor._model(x)
        if processor._model.lang_subset:
            prediction_batch_size = prediction_probs.size()[0]
            batch_mask = torch.stack([processor._model.lang_mask for _ in range(prediction_batch_size)])
            prediction_probs = prediction_probs * batch_mask
            prediction_probs = masked_softmax(vec = prediction_probs, mask = batch_mask)
        else:
            softmax = torch.nn.Softmax(dim = 1)
            prediction_probs = softmax(prediction_probs)
        topk = torch.topk(prediction_probs, k)
        pred_scores = {}
        for i,pred in enumerate(topk.indices[0]):
            print(f"Language: {processor._model.idx_to_tag[pred]}: {topk.values[0][i]}")
            pred_scores[processor._model.idx_to_tag[pred]] = topk.values[0][i].item()

    return pred_scores

model_name = "my_model"
pipeline = stanza.Pipeline("multilingual", langid_model_path=model_name+".pt")
pred_scores = get_predictions_scores(text = "hello how are you?", pipeline = pipeline, k=3)
pred_scores

Anyway now I have this output:

{'en': 0.9999321699142456,
 'nn': 2.8831263989559375e-05,
 'nl': 1.2516763490566518e-05}

EDIT I used a masked_softmax in case of processor._model.lang_subset is set, taking it from here: https://discuss.pytorch.org/t/apply-mask-softmax/14212/13

without doing that it seems that the percentages were wrong. I don't know if this is the most correct way to do that. This seems to solve also this issue I found https://github.com/stanfordnlp/stanza/issues/1076

AngledLuffa commented 2 years ago

Ah, I misunderstood your previous question. You had said you want the accuracy for all languages, but what you want is the predictions for all languages. I should be able to add that functionality to the processor. Even better, would you be up for turning your code in this message into a pull request against the dev branch, stanza/models/langid/model.py and/or stanza/pipeline/langid_processor.py ?

On Wed, Jul 13, 2022 at 3:48 AM Paolo Magnani @.***> wrote:

I've found this way to achieve this task. I wanted to arrive to _model output and have percentages that sum to 1 with a softmax layer.

To do that I overrided _process_list of LangIDProcessor class and prediction_scores of LangIDBiLSTM class. I'm not sure this it's the cleanest way to do that. Do you agree @AngledLuffa https://github.com/AngledLuffa ?

Also for the config I have to pass to LangIDProcessor, I'm not sure to have build the object correctly.

import torchfrom stanza.pipeline.langid_processor import LangIDProcessorfrom stanza.models.common.doc import Document def get_predictions_scores(text, pipeline, k): print(f"Text: {text}") print(f"Output of the pipeline directly: {pipeline(text).lang}")

config = {}
for key in pipeline.config:
    if key.startswith("langid_"):
        config[key.split("langid_")[1]] = pipeline.config[key]
    else:
        config[key] = pipeline.config[key]
processor = LangIDProcessor(config = config, pipeline = pipeline, use_gpu = True)

docs = [text]

#override of _process_list of LangIDProcessor
if isinstance(docs[0], str):
    docs = [Document([], text) for text in docs]
docs_by_length = {}
for doc in docs:
    text = processor.clean_text(doc.text) if processor._clean_text else doc.text
    doc_length = len(text)
    if doc_length not in docs_by_length:
        docs_by_length[doc_length] = []
    docs_by_length[doc_length].append((doc, text))

for doc_length in docs_by_length:
    inputs = [doc[1] for doc in docs_by_length[doc_length]]
    #override of prediction_scores of LangIDBiLSTM to get the predictions
    x = processor._text_to_tensor(inputs)
    prediction_probs = processor._model(x)
    if processor._model.lang_subset:
        prediction_batch_size = prediction_probs.size()[0]
        batch_mask = torch.stack([processor._model.lang_mask for _ in range(prediction_batch_size)])
        prediction_probs = prediction_probs * batch_mask
    softmax = torch.nn.Softmax(dim = 1)
    prediction_probs = softmax(prediction_probs)
    topk = torch.topk(prediction_probs, k)
    pred_scores = {}
    for i,pred in enumerate(topk.indices[0]):
        print(f"Language: {processor._model.idx_to_tag[pred]}: {topk.values[0][i]}")
        pred_scores[processor._model.idx_to_tag[pred]] = topk.values[0][i].item()

return pred_scores
        model_name = "my_model"pipeline = stanza.Pipeline("multilingual", langid_model_path=model_name+".pt")pred_scores = get_predictions_scores(text = "hello how are you?", pipeline = pipeline, k=3)pred_scores

Anyway now I have this output:

{'en': 0.9999321699142456, 'nn': 2.8831263989559375e-05, 'nl': 1.2516763490566518e-05}

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1072#issuecomment-1183068227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLSIOCD7PD23YDZPHLVT2NJRANCNFSM52Z37DSQ . You are receiving this because you were mentioned.Message ID: @.***>

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AngledLuffa commented 1 year ago

Ping regarding this - are you interested in making this block of code into a PR?