huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.08k stars 26.31k forks source link

Customize pretrained model for model hub #11905

Closed Matthieu-Tinycoaching closed 3 years ago

Matthieu-Tinycoaching commented 3 years ago

Hi community,

I would like to add mean pooling step inside a custom SentenceTransformer class derived from the model sentence-transformers/stsb-xlm-r-multilingual, in order to avoid to do this supplementary step after getting the tokens embeddings.

My aim is to push this custom model onto model hub. If not using this custom step, it is trivial as below:

from transformers import AutoTokenizer, AutoModel

Simple export
## Instanciate the model
model_name = "sentence-transformers/stsb-xlm-r-multilingual"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

## Save the model and tokenizer files into cloned repository
model.save_pretrained("path/to/repo/clone/your-model-name")
tokenizer.save_pretrained("path/to/repo/clone/your-model-name")

However, after defining my custom class SentenceTransformerCustom I can’t manage to push on model hub the definition of this class:

import transformers
import torch

#### Custom export ####
## 1. Load feature-extraction pipeline with specific sts model
model_name = "sentence-transformers/stsb-xlm-r-multilingual"
pipeline_name = "feature-extraction"
nlp = transformers.pipeline(pipeline_name, model=model_name, tokenizer=model_name)
tokenizer = nlp.tokenizer

## 2. Setting up a simple torch model, which inherits from the XLMRobertaModel model. The only thing we add is a weighted summation over the token embeddings and a clamp to prevent zero-division errors.
class SentenceTransformerCustom(transformers.XLMRobertaModel):
    def __init__(self, config):
        super().__init__(config)
        # Naming alias for ONNX output specification
        # Makes it easier to identify the layer
        self.sentence_embedding = torch.nn.Identity()

    def forward(self, input_ids, attention_mask):
        # Get the token embeddings from the base model
        token_embeddings = super().forward(
            input_ids, 
            attention_mask=attention_mask
        )[0]
        # Stack the pooling layer on top of it
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return self.sentence_embedding(sum_embeddings / sum_mask)

## 3. Create the custom model based on the config of the original pipeline
model = SentenceTransformerCustom(config=nlp.model.config).from_pretrained(model_name)

## 4. Save the model and tokenizer files into cloned repository
model.save_pretrained("/home/matthieu/Deployment/HF/stsb-xlm-r-multilingual")
tokenizer.save_pretrained("/home/matthieu/Deployment/HF/stsb-xlm-r-multilingual")

Do I need to place this custom class definition inside a specific .py file ? Or is there anything to do in order to correctly import this custom class from model hub?

Thanks!

LysandreJik commented 3 years ago

Maybe of interest to @nreimers

nreimers commented 3 years ago

Hi @Matthieu-Tinycoaching

I was sadly not able to re-produce your error. Have you uploaded such a model to the hub? Could you post the link here?

And how does your code look like to load the model?

Matthieu-Tinycoaching commented 3 years ago

Hi @nreimers

I retried with including the custom class definition when loading the model and it worked.