hooshvare / parsner

Pre-Trained NER models for Persian 🦁
20 stars 0 forks source link

Roberta Tokenizer #2

Open dehghanm opened 2 years ago

dehghanm commented 2 years ago

Hi

I want to use Roberta Tokenizer. In the following, there is an example that shows how we can do this.

from transformers import AutoTokenizer model_name = "HooshvareLab/roberta-fa-zwnj-base" tokenizer = AutoTokenizer.from_pretrained(model_name) string = "این یک سند است" tokenized_string = tokenizer.tokenize(string) print(tokenized_string)

The result of the above code is as follows: ['اÛĮÙĨ', 'ĠÛĮÚ©', 'ĠسÙĨد', 'Ġاست'] However, it should be: ["این", "یک", "سند" , "است"] What is your idea to solve this issue?