Adding special tokens in text2text generation task

huggingface / api-inference-community

Apache License 2.0

163 stars 61 forks source link

Adding special tokens in text2text generation task #165

Open techthiyanes opened 1 year ago

techthiyanes commented 1 year ago

Hi Team,

Could anyone please enable on displaying the special tokens for seq2seq models? Currently seq2seq model from inferenceapi's are displaying without special tokens. Special tokens are being added as a part of tokenizer. How could we add tokenizer args like add_special_tokens=True during inference API calls. These params are not allowed to enter inside while generating the text at decoder side.

Narsil commented 1 year ago

Hi, why do you want that option ?

Sorry, but we try to limit the number of parameters available (for simplicity).

This is also not available in transformers 's pipeline (which this API is derived from).

Could you maybe start an issue in transformers for that support, documenting as much as possible why, and in what context you need this option ? If we enable it in transformers it will instantly become available in the API (albeit not neceassarily being documented).

Cheers.

techthiyanes commented 1 year ago

Hi, Thanks for your response. There were some deq2seq models have special tokens as a part of config.json. While tokenizing any input phrase, we have an option to include add_special_tokens as true argument inside tokenizer. These special tokens will be displayed while doing beam/greedy search ouput as long as we enable it. we don't have an option to enable this params inside api inferences. All params i can include that's related to generate methods alone like do_sample, num_beams and so on.Let me know if you need further more details.

Narsil commented 1 year ago

Special tokens are meant to be non-readable, if you want to use readable tokens, couldn't you use regular added tokens ?

(tokenizer.add_tokens vs tokenizer.add_special_tokens IIRC)

Special tokens are special mostly because they are not shown. Stuff like [CLS] and [EOS] are generally not very interesting to read and do not correspond to what a model is saying, and that's why they are not displayed, right ?

techthiyanes commented 1 year ago

Thanks for your response.

There were some models it requires special tokens to be displayed. Those special tokens would be helping us to do meaningful post-processing. For example seq2seq entity extraction models, it has special tokens added in it. Based on special tokens, user can extract necessary entity results. Example Model Space : https://huggingface.co/Babelscape/rebel-large

Thanks