gmihaila / ml_things

This is where I put things I find useful that speed up my work with Machine Learning. Ever looked in your old projects to reuse those cool functions you created before? Well, this repo is designed to be a Python Library of functions I created in my previous project that can be reused. I also share some Notebooks Tutorials and Python Code Snippets.
https://gmihaila.github.io
Apache License 2.0
245 stars 61 forks source link

using a fine-tuned model for prediction on unlabeled data #19

Closed azespinoza closed 2 years ago

azespinoza commented 2 years ago

Hi,

Thanks for creating these helpers and finetuning tutorials. I've found this repo to be very helpful and appreciate all the work you've put into it :).

One question I have is in regards to the following validation function from your gpt2-finetune-classification notebook. Here is the code snippet:

    # speeding up validation
    with torch.no_grad():
        # Forward pass, calculate logit predictions.
        # This will return the logits rather than the loss because we have
        # not provided labels.
        # token_type_ids is the same as the "segment ids", which 
        # differentiates sentence 1 and 2 in 2-sentence tasks.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(**batch)

        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple along with the logits. We will use logits
        # later to to calculate training accuracy.
        loss, logits = outputs[:2]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()

        # get predicitons to list
        predict_content = logits.argmax(axis=-1).flatten().tolist()

        # update list
        predictions_labels += predict_content

I'm trying to repurpose this to predict on unlabeled data. Although the comments in the code suggest that we have not provided the labels, I am getting errors when feeding in a dictionary with one extra other category (indicating that no class has been assigned):

-> 2264         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   2265     elif dim == 4:
   2266         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
IndexError: Target 13 is out of bounds.

Is there a good way to change this to exclude the labels and label dictionary from being fed into the model, and also to exclude the loss from being calculated? Sorry if there is an obvious solution, it seems to be evading me right now.

Thank you for the help!

gmihaila commented 2 years ago

@azespinoza Thank you for the nice comments! 😃

If I understand right, you want to get the predictions of the model only, without giving any labels. If that is the case:

The batch that goes into the model is a dictionary that contains the key labels. You can just remove that and also remove the true_labels so you only get the prediction of the model.

If my understanding is off, please provide an example or more explanation to understand what is it you are trying to do.

azespinoza commented 2 years ago

Yes this helps a lot! Thank you for the detailed answer!