Open dehghanm opened 2 years ago
Hi
I want to use Roberta Tokenizer. In the following, there is an example that shows how we can do this.
from transformers import AutoTokenizer model_name = "HooshvareLab/roberta-fa-zwnj-base" tokenizer = AutoTokenizer.from_pretrained(model_name) string = "این یک سند است" tokenized_string = tokenizer.tokenize(string) print(tokenized_string)
from transformers import AutoTokenizer
model_name = "HooshvareLab/roberta-fa-zwnj-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
string = "این یک سند است"
tokenized_string = tokenizer.tokenize(string)
print(tokenized_string)
The result of the above code is as follows: ['اÛĮÙĨ', 'ĠÛĮÚ©', 'ĠسÙĨد', 'Ġاست'] However, it should be: ["این", "یک", "سند" , "است"] What is your idea to solve this issue?
['اÛĮÙĨ', 'ĠÛĮÚ©', 'ĠسÙĨد', 'Ġاست']
["این", "یک", "سند" , "است"]
Hi
I want to use Roberta Tokenizer. In the following, there is an example that shows how we can do this.
from transformers import AutoTokenizer
model_name = "HooshvareLab/roberta-fa-zwnj-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
string = "این یک سند است"
tokenized_string = tokenizer.tokenize(string)
print(tokenized_string)
The result of the above code is as follows:
['اÛĮÙĨ', 'ĠÛĮÚ©', 'ĠسÙĨد', 'Ġاست']
However, it should be:["این", "یک", "سند" , "است"]
What is your idea to solve this issue?