meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
27k stars 3.06k forks source link

I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!? #67

Closed thusinh1969 closed 5 months ago

thusinh1969 commented 6 months ago

I usually extend vocab to make the model closer to Vietnames language. The code is below. However, it seems that the tokenizer of LLaMA-3 is no longer work with SentencePiece. Even LlamaTokenizer is no longer compatible with LLaMA-3. Any hint please ?

In the meanwhile, standard AutoTokenizer can no longer load new LlaMA-3 's tokenizer.model. Any help highly appreciated.

import sentencepiece as spm
def extendVocab(tokenizer, source_tokenizer_file,
                extra_vocab_model_files, output_path, reload=True, verbose=False):

  # load current tokenizer proto
  print ('Create current vocab proto...')
  source_tokenizer = tokenizer.from_pretrained(source_tokenizer_file, trust_remote_code=True)
  try:
      base_spm = sp_pb2_model.ModelProto()
      base_spm.ParseFromString(source_tokenizer.sp_model.serialized_model_proto())  ### <---- error here !
  except:
      base_spm=source_tokenizer.get_vocab() 

  for new_vocab in extra_vocab_model_files:
      # create new temp tokenizer
      print ('Loading extra vocab file...', new_vocab)
      VN_sp_model = spm.SentencePieceProcessor()
      VN_sp_model.Load(new_vocab)
      print (len(VN_sp_model))
      # load new tokenizer proto
      print ('Create extra vocab proto...', )
      VN_spm = sp_pb2_model.ModelProto()
      VN_spm.ParseFromString(VN_sp_model.serialized_model_proto())

      # print number of tokens
      print("Source tokenizer len:", len(source_tokenizer))
      print("Extra tokenizer len:",len(VN_sp_model))
      print(source_tokenizer.all_special_tokens)
      print(source_tokenizer.all_special_ids)
      print(source_tokenizer.special_tokens_map)

      print ('Adding extra vocab into current vocab ...')

      ## Add extra tokens to current tokenizer
      spm_tokens_set=set(p.piece for p in base_spm.pieces)
      print(len(spm_tokens_set))
      print(f"Before:{len(spm_tokens_set)}")

      for p in VN_spm.pieces:
          piece = p.piece
          if piece not in spm_tokens_set:
              if verbose:
                print (piece)
              new_p = sp_pb2_model.ModelProto().SentencePiece()
              new_p.piece = piece
              new_p.score = 0
              base_spm.pieces.append(new_p)

      print(f"New model pieces: {len(base_spm.pieces)}")

  target_path_sp = "/".join(output_path.split('/')[:-1]) + "/sp"
  target_file = output_path.split('/')[-1]
  os.makedirs(target_path_sp,exist_ok=True)
  print ('Saving new tokenizer sp model:', target_path_sp+"/"+target_file)
  with open(target_path_sp+"/"+target_file, 'wb') as f:
      f.write(base_spm.SerializeToString())
  f.close()

  print ('Reloading sp model..')
  reload_extended_tokenizer = tokenizer(target_path_sp+"/"+target_file)
  hf_output_path = "/".join(output_path.split('/')[:-1])+ "/hf"
  os.makedirs(hf_output_path,exist_ok=True)
  print ('Saving new tokenizer hf model ...', hf_output_path)
  reload_extended_tokenizer.save_pretrained(hf_output_path)

  text='''Những công trình vĩ đại của bác Hồ Chí minh đã ghi dấu ấn lớn cho toàn thế giới và nhân loại. Bác là người đáng yêu.
  The primary use of LLaMA is research on large language models, including'''

  print(f"Tokenized by origin tokenizer:{source_tokenizer.tokenize(text)}")
  print(f"Tokenized by new tokenizer:{reload_extended_tokenizer.tokenize(text)}")

  print ('Reloading completely new HF tokenizer ...')

  reloaded_tokenizer = tokenizer.from_pretrained(hf_output_path, trust_remote_code=True)
  print (reloaded_tokenizer)
  return reloaded_tokenizer

Thanks, Steve

StephennFernandes commented 6 months ago

cc @ArthurZucker

is there a way this could be handled in hf tokenizers ?

a few pointers and/or some code would really help a lot of folks

amitsangani commented 6 months ago

Llama 3 has improved tokenizer based on Tiktoken v/s Llama 2 which was based on Sentencepiece. Llama 3 tokenizer expands the vocabulary size to 128k (from 32k tokens in the previous version).

https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py

Can you try AutoTokenizer instead of LlamaTokenizer?

StephennFernandes commented 6 months ago

@amitsangani AutoTokenizer doesnt work

ideally the following was the go to script to extend the tokenizer in LLaMa-2

import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
from transformers import LlamaTokenizer
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--llama_tokenizer_dir', default="meta-llama/Llama-2-7b-hf", type=str)
parser.add_argument('--chinese_sp_model_file', default='./chinese_sp.model', type=str)
args = parser.parse_args()

llama_tokenizer_dir = args.llama_tokenizer_dir
chinese_sp_model_file = args.chinese_sp_model_file

# load
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_sp_model = spm.SentencePieceProcessor()
chinese_sp_model.Load(chinese_sp_model_file)

llama_spm = sp_pb2_model.ModelProto()
llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
chinese_spm = sp_pb2_model.ModelProto()
chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())

# print number of tokens
print(len(llama_tokenizer),len(chinese_sp_model))
print(llama_tokenizer.all_special_tokens)
print(llama_tokenizer.all_special_ids)
print(llama_tokenizer.special_tokens_map)

## Add Chinese tokens to LLaMA tokenizer
llama_spm_tokens_set=set(p.piece for p in llama_spm.pieces)
print(len(llama_spm_tokens_set))
print(f"Before:{len(llama_spm_tokens_set)}")
for p in chinese_spm.pieces:
    piece = p.piece
    if piece not in llama_spm_tokens_set:
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        llama_spm.pieces.append(new_p)
print(f"New model pieces: {len(llama_spm.pieces)}")

## Save
output_sp_dir = 'merged_tokenizer_sp'
output_hf_dir = 'merged_tokenizer_hf' # the path to save Chinese-LLaMA tokenizer
os.makedirs(output_sp_dir,exist_ok=True)
with open(output_sp_dir+'/chinese_llama.model', 'wb') as f:
    f.write(llama_spm.SerializeToString())
tokenizer = LlamaTokenizer(vocab_file=output_sp_dir+'/chinese_llama.model')

tokenizer.save_pretrained(output_hf_dir)
print(f"Chinese-LLaMA tokenizer has been saved to {output_hf_dir}")

# Test
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_llama_tokenizer = LlamaTokenizer.from_pretrained(output_hf_dir)
print(tokenizer.all_special_tokens)
print(tokenizer.all_special_ids)
print(tokenizer.special_tokens_map)
text='''白日依山尽,黄河入海流。欲穷千里目,更上一层楼。
The primary use of LLaMA is research on large language models, including'''
print("Test text:\n",text)
print(f"Tokenized by LLaMA tokenizer:{llama_tokenizer.tokenize(text)}")
print(f"Tokenized by Chinese-LLaMA tokenizer:{chinese_llama_tokenizer.tokenize(text)}")

upon changing the LlamaTokenizer to AutoTokenizer and trying to extend the tokenizer on LLaMa-3 the following is the error.

  File "/media/user/drive_2/tokenizer_extension/merge_tokenizer.py", line 21, in <module>
    llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
                              ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'sp_model'

cc @ArthurZucker does this looks like a hf issue ? currently running transformers version 4.33.1

thusinh1969 commented 6 months ago

Llama 3 has improved tokenizer based on Tiktoken v/s Llama 2 which was based on Sentencepiece. Llama 3 tokenizer expands the vocabulary size to 128k (from 32k tokens in the previous version).

https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py

Can you try AutoTokenizer instead of LlamaTokenizer?

I tried but no hope. If there is a quick codes will help. 128k vocab still does not cover basic vocab of VNese.

Thanks and advanced. Steve

StephennFernandes commented 6 months ago

despite setting use_fast=False in loading the llama tokenizer using AutoTokenizer i still get the same error

@amitsangani AutoTokenizer doesnt work

ideally the following was the go to script to extend the tokenizer in LLaMa-2

import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"
from transformers import LlamaTokenizer
from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
import sentencepiece as spm
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--llama_tokenizer_dir', default="meta-llama/Llama-2-7b-hf", type=str)
parser.add_argument('--chinese_sp_model_file', default='./chinese_sp.model', type=str)
args = parser.parse_args()

llama_tokenizer_dir = args.llama_tokenizer_dir
chinese_sp_model_file = args.chinese_sp_model_file

# load
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_sp_model = spm.SentencePieceProcessor()
chinese_sp_model.Load(chinese_sp_model_file)

llama_spm = sp_pb2_model.ModelProto()
llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
chinese_spm = sp_pb2_model.ModelProto()
chinese_spm.ParseFromString(chinese_sp_model.serialized_model_proto())

# print number of tokens
print(len(llama_tokenizer),len(chinese_sp_model))
print(llama_tokenizer.all_special_tokens)
print(llama_tokenizer.all_special_ids)
print(llama_tokenizer.special_tokens_map)

## Add Chinese tokens to LLaMA tokenizer
llama_spm_tokens_set=set(p.piece for p in llama_spm.pieces)
print(len(llama_spm_tokens_set))
print(f"Before:{len(llama_spm_tokens_set)}")
for p in chinese_spm.pieces:
    piece = p.piece
    if piece not in llama_spm_tokens_set:
        new_p = sp_pb2_model.ModelProto().SentencePiece()
        new_p.piece = piece
        new_p.score = 0
        llama_spm.pieces.append(new_p)
print(f"New model pieces: {len(llama_spm.pieces)}")

## Save
output_sp_dir = 'merged_tokenizer_sp'
output_hf_dir = 'merged_tokenizer_hf' # the path to save Chinese-LLaMA tokenizer
os.makedirs(output_sp_dir,exist_ok=True)
with open(output_sp_dir+'/chinese_llama.model', 'wb') as f:
    f.write(llama_spm.SerializeToString())
tokenizer = LlamaTokenizer(vocab_file=output_sp_dir+'/chinese_llama.model')

tokenizer.save_pretrained(output_hf_dir)
print(f"Chinese-LLaMA tokenizer has been saved to {output_hf_dir}")

# Test
llama_tokenizer = LlamaTokenizer.from_pretrained(llama_tokenizer_dir)
chinese_llama_tokenizer = LlamaTokenizer.from_pretrained(output_hf_dir)
print(tokenizer.all_special_tokens)
print(tokenizer.all_special_ids)
print(tokenizer.special_tokens_map)
text='''白日依山尽,黄河入海流。欲穷千里目,更上一层楼。
The primary use of LLaMA is research on large language models, including'''
print("Test text:\n",text)
print(f"Tokenized by LLaMA tokenizer:{llama_tokenizer.tokenize(text)}")
print(f"Tokenized by Chinese-LLaMA tokenizer:{chinese_llama_tokenizer.tokenize(text)}")

upon changing the LlamaTokenizer to AutoTokenizer and trying to extend the tokenizer on LLaMa-3 the following is the error.

  File "/media/user/drive_2/tokenizer_extension/merge_tokenizer.py", line 21, in <module>
    llama_spm.ParseFromString(llama_tokenizer.sp_model.serialized_model_proto())
                              ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'sp_model'

cc @ArthurZucker does this looks like a hf issue ? currently running transformers version 4.33.1

thusinh1969 commented 6 months ago

Any help please...!

amitsangani commented 6 months ago

@osanseviero @HamidShojanazeri - any ideas on how to resolve this?

VishnuPJ commented 6 months ago

@StephennFernandes , Any update? I am also trying to do the same.

thusinh1969 commented 6 months ago

I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.

model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"

from transformers import AutoTokenizer
model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check length of LLaMA-3 tokenizer
len(tokenizer)
>>> 128256

# Check tokenizering Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['Tôi',
 'ĠnhỼ',
 'Ġl',
 'ắm',
 'ĠB',
 'ác',
 'ĠHá»ĵ',
 'ĠkÃŃnh',
 'Ġyêu',
 'Ġcủa',
 'ĠÄijạo',
 'Ġph',
 'áºŃt']

# Check tokenizering English
tokenizer.tokenize("My English class will open in June 2024")
>>> ['My', 'ĠEnglish', 'Ġclass', 'Ġwill', 'Ġopen', 'Ġin', 'ĠJune', 'Ġ', '202', '4']

# Add all 4 new vocabs 
all_vocabs = ["/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/VN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/CN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/VN-LLama-tokenizer_40k_36m_sp/VN40kHUGE_unigram_36m.model",
              "/home/steve/data01/VN-vocab-model/Ancient-Vocab-4157/Ancient-Vocab-4157.model"]

import sentencepiece as spm
VN_sp_model = spm.SentencePieceProcessor()
for v in all_vocabs:
    VN_sp_model.Load(v)
    vocab = [str(VN_sp_model.decode(i)) for i in range(len(VN_sp_model))]
    tokenizer.add_tokens(vocab)

# Check new length of LLaMA-3 tokenizer
len(tokenizer)
>>> 197453

# Test new tokenizer with Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật từ ngày 12/4/2019")
>>> ['Tôi',
 'Ġ',
 'nhớ',
 'Ġ',
 'lắm',
 'Ġ',
 'Bác',
 'Ġ',
 'Hồ',
 'Ġ',
 'kính',
 'Ġ',
 'yêu',
 'Ġ',
 'của',
 'Ġ',
 'đạo',
 'Ġ',
 'phật',
 'Ġ',
 'từ',
 'Ġ',
 'ngày',
 'Ġ',
 '12/4',
 '/2019']

# Test new tokenizer with same English statement
tokenizer.tokenize("My English class will open in June 2024") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['My',
 'Ġ',
 'English',
 'Ġ',
 'class',
 'Ġ',
 'will',
 'Ġ',
 'open',
 'Ġ',
 'in',
 'Ġ',
 'June',
 'Ġ',
 '2024']

Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ.

Please comment and hints if any. We need solid soluton from Meta. Steve

ArthurZucker commented 6 months ago

Hi all! This is not a hf bug. For any tokenizer that is in transformers and that you load using AutoTokenizer.from_pretrained you can add any token using tokenizer.add_tokens(["token1", "token2",]) etc. There is not need for a complexe logic, and @thusinh1969's proposal works as expected. Reload should not be super slow however, that might be a bug. One fix could be:

from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("my-new-tokenizer")
StephennFernandes commented 6 months ago

okay that pretty much solves this. @ArthurZucker , could you please confirm whats the correct way to check the condition: if the new token from the new extended tokenizer exists in the original llama tokenizer ?

i currently do this:

for p in tqdm(chinese_spm.pieces, desc="merging tokenizers"):
    piece = p.piece 
    if piece not in llama_tokenizer.vocab.keys():
        llama_tokenizer.add_tokens(piece)
StephennFernandes commented 6 months ago

@amitsangani could you also share the steps on how to train tiktoken tokenizer from scratch, given that you guys have found better tokenizer efficiency would be great to train the extended tokenizer using tiktoken and extend it to llama tokenizer.

VishnuPJ commented 6 months ago

I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.

model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"

from transformers import AutoTokenizer
model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check length of LLaMA-3 tokenizer
len(tokenizer)
>>> 128256

# Check tokenizering Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['Tôi',
 'ĠnhỼ',
 'Ġl',
 'ắm',
 'ĠB',
 'ác',
 'ĠHá»ĵ',
 'ĠkÃŃnh',
 'Ġyêu',
 'Ġcủa',
 'ĠÄijạo',
 'Ġph',
 'áºŃt']

# Check tokenizering English
tokenizer.tokenize("My English class will open in June 2024")
>>> ['My', 'ĠEnglish', 'Ġclass', 'Ġwill', 'Ġopen', 'Ġin', 'ĠJune', 'Ġ', '202', '4']

# Add all 4 new vocabs 
all_vocabs = ["/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/VN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/CN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/VN-LLama-tokenizer_40k_36m_sp/VN40kHUGE_unigram_36m.model",
              "/home/steve/data01/VN-vocab-model/Ancient-Vocab-4157/Ancient-Vocab-4157.model"]

import sentencepiece as spm
VN_sp_model = spm.SentencePieceProcessor()
for v in all_vocabs:
    VN_sp_model.Load(v)
    vocab = [str(VN_sp_model.decode(i)) for i in range(len(VN_sp_model))]
    tokenizer.add_tokens(vocab)

# Check new length of LLaMA-3 tokenizer
len(tokenizer)
>>> 197453

# Test new tokenizer with Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật từ ngày 12/4/2019")
>>> ['Tôi',
 'Ġ',
 'nhớ',
 'Ġ',
 'lắm',
 'Ġ',
 'Bác',
 'Ġ',
 'Hồ',
 'Ġ',
 'kính',
 'Ġ',
 'yêu',
 'Ġ',
 'của',
 'Ġ',
 'đạo',
 'Ġ',
 'phật',
 'Ġ',
 'từ',
 'Ġ',
 'ngày',
 'Ġ',
 '12/4',
 '/2019']

# Test new tokenizer with same English statement
tokenizer.tokenize("My English class will open in June 2024") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['My',
 'Ġ',
 'English',
 'Ġ',
 'class',
 'Ġ',
 'will',
 'Ġ',
 'open',
 'Ġ',
 'in',
 'Ġ',
 'June',
 'Ġ',
 '2024']

Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ.

Please comment and hints if any. We need solid soluton from Meta. Steve

I did the way as suggested by @thusinh1969 . I modified the tokenizer and resized token embedding s using "model.resize_token_embeddings(len(tokenizer))". But when I tried to run I am getting, "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn"

thusinh1969 commented 6 months ago

Hi all! This is not a hf bug. For any tokenizer that is in transformers and that you load using AutoTokenizer.from_pretrained you can add any token using tokenizer.add_tokens(["token1", "token2",]) etc. There is not need for a complexe logic, and @thusinh1969's proposal works as expected. Reload should not be super slow however, that might be a bug. One fix could be:

from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("my-new-tokenizer")

It is a completely different Tokenizer. Have to do likes this:

from tokenizers import Tokenizer
from transformers import PreTrainedTokenizerFast
tokenizer_new = Tokenizer.from_pretrained("thusinh1969/llama-3-VN-CN-Ancient-tokenizer")
tokenizer_new_fast = PreTrainedTokenizerFast(tokenizer_object=tokenizer_new)

Now you can use tokenizer_new_fast as tokenizer as usual.

StephennFernandes commented 6 months ago

I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.

model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"

from transformers import AutoTokenizer
model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check length of LLaMA-3 tokenizer
len(tokenizer)
>>> 128256

# Check tokenizering Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['Tôi',
 'ĠnhỼ',
 'Ġl',
 'ắm',
 'ĠB',
 'ác',
 'ĠHá»ĵ',
 'ĠkÃŃnh',
 'Ġyêu',
 'Ġcủa',
 'ĠÄijạo',
 'Ġph',
 'áºŃt']

# Check tokenizering English
tokenizer.tokenize("My English class will open in June 2024")
>>> ['My', 'ĠEnglish', 'Ġclass', 'Ġwill', 'Ġopen', 'Ġin', 'ĠJune', 'Ġ', '202', '4']

# Add all 4 new vocabs 
all_vocabs = ["/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/VN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/CN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/VN-LLama-tokenizer_40k_36m_sp/VN40kHUGE_unigram_36m.model",
              "/home/steve/data01/VN-vocab-model/Ancient-Vocab-4157/Ancient-Vocab-4157.model"]

import sentencepiece as spm
VN_sp_model = spm.SentencePieceProcessor()
for v in all_vocabs:
    VN_sp_model.Load(v)
    vocab = [str(VN_sp_model.decode(i)) for i in range(len(VN_sp_model))]
    tokenizer.add_tokens(vocab)

# Check new length of LLaMA-3 tokenizer
len(tokenizer)
>>> 197453

# Test new tokenizer with Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật từ ngày 12/4/2019")
>>> ['Tôi',
 'Ġ',
 'nhớ',
 'Ġ',
 'lắm',
 'Ġ',
 'Bác',
 'Ġ',
 'Hồ',
 'Ġ',
 'kính',
 'Ġ',
 'yêu',
 'Ġ',
 'của',
 'Ġ',
 'đạo',
 'Ġ',
 'phật',
 'Ġ',
 'từ',
 'Ġ',
 'ngày',
 'Ġ',
 '12/4',
 '/2019']

# Test new tokenizer with same English statement
tokenizer.tokenize("My English class will open in June 2024") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['My',
 'Ġ',
 'English',
 'Ġ',
 'class',
 'Ġ',
 'will',
 'Ġ',
 'open',
 'Ġ',
 'in',
 'Ġ',
 'June',
 'Ġ',
 '2024']

Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ.

Please comment and hints if any. We need solid soluton from Meta. Steve

I did the way as suggested by @thusinh1969 . I modified the tokenizer and resized token embedding s using "model.resize_token_embeddings(len(tokenizer))". But when I tried to run I am getting, "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn"

@thusinh1969 are you getting this issue as well, when expanding token embedding during continual pre-training ?

thusinh1969 commented 6 months ago

I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.

model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"

from transformers import AutoTokenizer
model_name = "/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check length of LLaMA-3 tokenizer
len(tokenizer)
>>> 128256

# Check tokenizering Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['Tôi',
 'ĠnhỼ',
 'Ġl',
 'ắm',
 'ĠB',
 'ác',
 'ĠHá»ĵ',
 'ĠkÃŃnh',
 'Ġyêu',
 'Ġcủa',
 'ĠÄijạo',
 'Ġph',
 'áºŃt']

# Check tokenizering English
tokenizer.tokenize("My English class will open in June 2024")
>>> ['My', 'ĠEnglish', 'Ġclass', 'Ġwill', 'Ġopen', 'Ġin', 'ĠJune', 'Ġ', '202', '4']

# Add all 4 new vocabs 
all_vocabs = ["/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/VN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/poem_dataset/PRE-TRAINING-200G/VOCAB_COMBINED/CN-KINH-30k_unigram.model",
              "/home/steve/data01/VN-vocab-model/VN-LLama-tokenizer_40k_36m_sp/VN40kHUGE_unigram_36m.model",
              "/home/steve/data01/VN-vocab-model/Ancient-Vocab-4157/Ancient-Vocab-4157.model"]

import sentencepiece as spm
VN_sp_model = spm.SentencePieceProcessor()
for v in all_vocabs:
    VN_sp_model.Load(v)
    vocab = [str(VN_sp_model.decode(i)) for i in range(len(VN_sp_model))]
    tokenizer.add_tokens(vocab)

# Check new length of LLaMA-3 tokenizer
len(tokenizer)
>>> 197453

# Test new tokenizer with Vietnamese
tokenizer.tokenize("Tôi nhớ lắm Bác Hồ kính yêu của đạo phật từ ngày 12/4/2019")
>>> ['Tôi',
 'Ġ',
 'nhớ',
 'Ġ',
 'lắm',
 'Ġ',
 'Bác',
 'Ġ',
 'Hồ',
 'Ġ',
 'kính',
 'Ġ',
 'yêu',
 'Ġ',
 'của',
 'Ġ',
 'đạo',
 'Ġ',
 'phật',
 'Ġ',
 'từ',
 'Ġ',
 'ngày',
 'Ġ',
 '12/4',
 '/2019']

# Test new tokenizer with same English statement
tokenizer.tokenize("My English class will open in June 2024") # Tôi nhớ lắm Bác Hồ kính yêu của đạo Phật
>>> ['My',
 'Ġ',
 'English',
 'Ġ',
 'class',
 'Ġ',
 'will',
 'Ġ',
 'open',
 'Ġ',
 'in',
 'Ġ',
 'June',
 'Ġ',
 '2024']

Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ. Please comment and hints if any. We need solid soluton from Meta. Steve

I did the way as suggested by @thusinh1969 . I modified the tokenizer and resized token embedding s using "model.resize_token_embeddings(len(tokenizer))". But when I tried to run I am getting, "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn"

@thusinh1969 are you getting this issue as well, when expanding token embedding during continual pre-training ?

No. It is a different error regarding your model setting, probably to do with gradient.

from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
from tokenizers import Tokenizer

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=getattr(torch, "bfloat16"),
                                             device_map='auto',
                                             low_cpu_mem_usage=True)
tokenizer_new = Tokenizer.from_pretrained("thusinh1969/llama-3-VN-CN-Ancient-tokenizer")
tokenizer_new_fast = PreTrainedTokenizerFast(tokenizer_object=tokenizer_new)
model.resize_token_embeddings(len(tokenizer_new_fast))

That should do it.

thusinh1969 commented 6 months ago

FYI. In order to finetune further LlaMA-3 finetuned model, with this new extended tokenizer with proper LLaMA-3 format, you have to change the ChatFormat function as follows:

class ChatFormat:
    def __init__(self, tokenizer: Tokenizer):
        self.tokenizer = tokenizer

    def encode_header(self, message: Message) -> List[int]:
        tokens = []
        tokens.append(self.tokenizer.added_tokens_encoder["<|start_header_id|>"])
        tokens.extend(self.tokenizer.encode(message["role"], add_special_tokens=False))
        tokens.append(self.tokenizer.added_tokens_encoder["<|end_header_id|>"])
        tokens.extend(self.tokenizer.encode("\n\n", add_special_tokens=False))
        return tokens

    def encode_message(self, message: Message) -> List[int]:
        tokens = self.encode_header(message)
        tokens.extend(
            self.tokenizer.encode(message["content"].strip(), add_special_tokens=False)
        )
        tokens.append(self.tokenizer.added_tokens_encoder["<|eot_id|>"])
        return tokens

    def encode_dialog_prompt(self, dialog: Dialog) -> List[int]:
        tokens = []
        tokens.append(self.tokenizer.added_tokens_encoder["<|begin_of_text|>"])
        for message in dialog:
            tokens.extend(self.encode_message(message))
        # Add the start of an assistant message for the model to complete.
        tokens.extend(self.encode_header({"role": "assistant", "content": ""}))
        return tokens
ArthurZucker commented 6 months ago

Regarding efficiency, I'll check as well, the ignore_merges should imporve it anyways

thusinh1969 commented 6 months ago

Something is WRONG. The decoding of PreTrainedTokenizerFast (which LLaMA-3 are using) decode weird output once you add that token to the vocab using .add_tokens(word) function.

I use standard tokenizer from LLaMA-3 repo and add only ONE word to the origin tokenizer and...:

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.add_tokens(tokenizers.AddedToken("Bác"))
tokenizer
>>>PreTrainedTokenizerFast(name_or_path='/home/steve/data02/LLaMA/LLaMA-3/models/llama-3-8b-instruct/', vocab_size=128000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|end_of_text|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128004: AddedToken("<|reserved_special_token_2|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128005: AddedToken("<|reserved_special_token_3|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128006: AddedToken("<|start_header_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128007: AddedToken("<|end_header_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128008: AddedToken("<|reserved_special_token_4|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128009: AddedToken("<|eot_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128010: AddedToken("<|reserved_special_token_5|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128011: AddedToken("<|reserved_special_token_6|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128012: AddedToken("<|reserved_special_token_7|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128013: AddedToken("<|reserved_special_token_8|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128014: AddedToken("<|reserved_special_token_9|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128015: AddedToken("<|reserved_special_token_10|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128016: AddedToken("<|reserved_special_token_11|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128017: AddedToken("<|reserved_special_token_12|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128018: AddedToken("<|reserved_special_token_13|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128019: AddedToken("<|reserved_special_token_14|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128020: AddedToken("<|reserved_special_token_15|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128021: AddedToken("<|reserved_special_token_16|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128022: AddedToken("<|reserved_special_token_17|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128023: AddedToken("<|reserved_special_token_18|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128024: AddedToken("<|reserved_special_token_19|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128025: AddedToken("<|reserved_special_token_20|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128026: AddedToken("<|reserved_special_token_21|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128027: AddedToken("<|reserved_special_token_22|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128028: AddedToken("<|reserved_special_token_23|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128029: AddedToken("<|reserved_special_token_24|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128030: AddedToken("<|reserved_special_token_25|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128031: AddedToken("<|reserved_special_token_26|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128032: AddedToken("<|reserved_special_token_27|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128033: AddedToken("<|reserved_special_token_28|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128034: AddedToken("<|reserved_special_token_29|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128035: AddedToken("<|reserved_special_token_30|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128036: AddedToken("<|reserved_special_token_31|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128037: AddedToken("<|reserved_special_token_32|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128038: AddedToken("<|reserved_special_token_33|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128039: AddedToken("<|reserved_special_token_34|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128040: AddedToken("<|reserved_special_token_35|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128041: AddedToken("<|reserved_special_token_36|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128042: AddedToken("<|reserved_special_token_37|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128043: AddedToken("<|reserved_special_token_38|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128044: AddedToken("<|reserved_special_token_39|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128045: AddedToken("<|reserved_special_token_40|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128046: AddedToken("<|reserved_special_token_41|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128047: AddedToken("<|reserved_special_token_42|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128048: AddedToken("<|reserved_special_token_43|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128049: AddedToken("<|reserved_special_token_44|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128050: AddedToken("<|reserved_special_token_45|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128051: AddedToken("<|reserved_special_token_46|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128052: AddedToken("<|reserved_special_token_47|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128053: AddedToken("<|reserved_special_token_48|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128054: AddedToken("<|reserved_special_token_49|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128055: AddedToken("<|reserved_special_token_50|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128056: AddedToken("<|reserved_special_token_51|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128057: AddedToken("<|reserved_special_token_52|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128058: AddedToken("<|reserved_special_token_53|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128059: AddedToken("<|reserved_special_token_54|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128060: AddedToken("<|reserved_special_token_55|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128061: AddedToken("<|reserved_special_token_56|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128062: AddedToken("<|reserved_special_token_57|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128063: AddedToken("<|reserved_special_token_58|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128064: AddedToken("<|reserved_special_token_59|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128065: AddedToken("<|reserved_special_token_60|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128066: AddedToken("<|reserved_special_token_61|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128067: AddedToken("<|reserved_special_token_62|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128068: AddedToken("<|reserved_special_token_63|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128069: AddedToken("<|reserved_special_token_64|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128070: AddedToken("<|reserved_special_token_65|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128071: AddedToken("<|reserved_special_token_66|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128072: AddedToken("<|reserved_special_token_67|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128073: AddedToken("<|reserved_special_token_68|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128074: AddedToken("<|reserved_special_token_69|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128075: AddedToken("<|reserved_special_token_70|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128076: AddedToken("<|reserved_special_token_71|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128077: AddedToken("<|reserved_special_token_72|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128078: AddedToken("<|reserved_special_token_73|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128079: AddedToken("<|reserved_special_token_74|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128080: AddedToken("<|reserved_special_token_75|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128081: AddedToken("<|reserved_special_token_76|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128082: AddedToken("<|reserved_special_token_77|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128083: AddedToken("<|reserved_special_token_78|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128084: AddedToken("<|reserved_special_token_79|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128085: AddedToken("<|reserved_special_token_80|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128086: AddedToken("<|reserved_special_token_81|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128087: AddedToken("<|reserved_special_token_82|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128088: AddedToken("<|reserved_special_token_83|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128089: AddedToken("<|reserved_special_token_84|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128090: AddedToken("<|reserved_special_token_85|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128091: AddedToken("<|reserved_special_token_86|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128092: AddedToken("<|reserved_special_token_87|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128093: AddedToken("<|reserved_special_token_88|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128094: AddedToken("<|reserved_special_token_89|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128095: AddedToken("<|reserved_special_token_90|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128096: AddedToken("<|reserved_special_token_91|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128097: AddedToken("<|reserved_special_token_92|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128098: AddedToken("<|reserved_special_token_93|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128099: AddedToken("<|reserved_special_token_94|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128100: AddedToken("<|reserved_special_token_95|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128101: AddedToken("<|reserved_special_token_96|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128102: AddedToken("<|reserved_special_token_97|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128103: AddedToken("<|reserved_special_token_98|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128104: AddedToken("<|reserved_special_token_99|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128105: AddedToken("<|reserved_special_token_100|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128106: AddedToken("<|reserved_special_token_101|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128107: AddedToken("<|reserved_special_token_102|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128108: AddedToken("<|reserved_special_token_103|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128109: AddedToken("<|reserved_special_token_104|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128110: AddedToken("<|reserved_special_token_105|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128111: AddedToken("<|reserved_special_token_106|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128112: AddedToken("<|reserved_special_token_107|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128113: AddedToken("<|reserved_special_token_108|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128114: AddedToken("<|reserved_special_token_109|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128115: AddedToken("<|reserved_special_token_110|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128116: AddedToken("<|reserved_special_token_111|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128117: AddedToken("<|reserved_special_token_112|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128118: AddedToken("<|reserved_special_token_113|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128119: AddedToken("<|reserved_special_token_114|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128120: AddedToken("<|reserved_special_token_115|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128121: AddedToken("<|reserved_special_token_116|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128122: AddedToken("<|reserved_special_token_117|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128123: AddedToken("<|reserved_special_token_118|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128124: AddedToken("<|reserved_special_token_119|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128125: AddedToken("<|reserved_special_token_120|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128126: AddedToken("<|reserved_special_token_121|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128127: AddedToken("<|reserved_special_token_122|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128128: AddedToken("<|reserved_special_token_123|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128129: AddedToken("<|reserved_special_token_124|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128130: AddedToken("<|reserved_special_token_125|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128131: AddedToken("<|reserved_special_token_126|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128132: AddedToken("<|reserved_special_token_127|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128133: AddedToken("<|reserved_special_token_128|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128134: AddedToken("<|reserved_special_token_129|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128135: AddedToken("<|reserved_special_token_130|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128136: AddedToken("<|reserved_special_token_131|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128137: AddedToken("<|reserved_special_token_132|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128138: AddedToken("<|reserved_special_token_133|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128139: AddedToken("<|reserved_special_token_134|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128140: AddedToken("<|reserved_special_token_135|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128141: AddedToken("<|reserved_special_token_136|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128142: AddedToken("<|reserved_special_token_137|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128143: AddedToken("<|reserved_special_token_138|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128144: AddedToken("<|reserved_special_token_139|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128145: AddedToken("<|reserved_special_token_140|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128146: AddedToken("<|reserved_special_token_141|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128147: AddedToken("<|reserved_special_token_142|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128148: AddedToken("<|reserved_special_token_143|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128149: AddedToken("<|reserved_special_token_144|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128150: AddedToken("<|reserved_special_token_145|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128151: AddedToken("<|reserved_special_token_146|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128152: AddedToken("<|reserved_special_token_147|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128153: AddedToken("<|reserved_special_token_148|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128154: AddedToken("<|reserved_special_token_149|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128155: AddedToken("<|reserved_special_token_150|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128156: AddedToken("<|reserved_special_token_151|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128157: AddedToken("<|reserved_special_token_152|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128158: AddedToken("<|reserved_special_token_153|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128159: AddedToken("<|reserved_special_token_154|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128160: AddedToken("<|reserved_special_token_155|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128161: AddedToken("<|reserved_special_token_156|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128162: AddedToken("<|reserved_special_token_157|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128163: AddedToken("<|reserved_special_token_158|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128164: AddedToken("<|reserved_special_token_159|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128165: AddedToken("<|reserved_special_token_160|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128166: AddedToken("<|reserved_special_token_161|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128167: AddedToken("<|reserved_special_token_162|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128168: AddedToken("<|reserved_special_token_163|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128169: AddedToken("<|reserved_special_token_164|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128170: AddedToken("<|reserved_special_token_165|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128171: AddedToken("<|reserved_special_token_166|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128172: AddedToken("<|reserved_special_token_167|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128173: AddedToken("<|reserved_special_token_168|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128174: AddedToken("<|reserved_special_token_169|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128175: AddedToken("<|reserved_special_token_170|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128176: AddedToken("<|reserved_special_token_171|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128177: AddedToken("<|reserved_special_token_172|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128178: AddedToken("<|reserved_special_token_173|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128179: AddedToken("<|reserved_special_token_174|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128180: AddedToken("<|reserved_special_token_175|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128181: AddedToken("<|reserved_special_token_176|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128182: AddedToken("<|reserved_special_token_177|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128183: AddedToken("<|reserved_special_token_178|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128184: AddedToken("<|reserved_special_token_179|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128185: AddedToken("<|reserved_special_token_180|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128186: AddedToken("<|reserved_special_token_181|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128187: AddedToken("<|reserved_special_token_182|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128188: AddedToken("<|reserved_special_token_183|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128189: AddedToken("<|reserved_special_token_184|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128190: AddedToken("<|reserved_special_token_185|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128191: AddedToken("<|reserved_special_token_186|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128192: AddedToken("<|reserved_special_token_187|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128193: AddedToken("<|reserved_special_token_188|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128194: AddedToken("<|reserved_special_token_189|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128195: AddedToken("<|reserved_special_token_190|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128196: AddedToken("<|reserved_special_token_191|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128197: AddedToken("<|reserved_special_token_192|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128198: AddedToken("<|reserved_special_token_193|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128199: AddedToken("<|reserved_special_token_194|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128200: AddedToken("<|reserved_special_token_195|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128201: AddedToken("<|reserved_special_token_196|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128202: AddedToken("<|reserved_special_token_197|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128203: AddedToken("<|reserved_special_token_198|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128204: AddedToken("<|reserved_special_token_199|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128205: AddedToken("<|reserved_special_token_200|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128206: AddedToken("<|reserved_special_token_201|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128207: AddedToken("<|reserved_special_token_202|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128208: AddedToken("<|reserved_special_token_203|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128209: AddedToken("<|reserved_special_token_204|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128210: AddedToken("<|reserved_special_token_205|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128211: AddedToken("<|reserved_special_token_206|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128212: AddedToken("<|reserved_special_token_207|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128213: AddedToken("<|reserved_special_token_208|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128214: AddedToken("<|reserved_special_token_209|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128215: AddedToken("<|reserved_special_token_210|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128216: AddedToken("<|reserved_special_token_211|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128217: AddedToken("<|reserved_special_token_212|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128218: AddedToken("<|reserved_special_token_213|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128219: AddedToken("<|reserved_special_token_214|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128220: AddedToken("<|reserved_special_token_215|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128221: AddedToken("<|reserved_special_token_216|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128222: AddedToken("<|reserved_special_token_217|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128223: AddedToken("<|reserved_special_token_218|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128224: AddedToken("<|reserved_special_token_219|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128225: AddedToken("<|reserved_special_token_220|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128226: AddedToken("<|reserved_special_token_221|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128227: AddedToken("<|reserved_special_token_222|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128228: AddedToken("<|reserved_special_token_223|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128229: AddedToken("<|reserved_special_token_224|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128230: AddedToken("<|reserved_special_token_225|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128231: AddedToken("<|reserved_special_token_226|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128232: AddedToken("<|reserved_special_token_227|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128233: AddedToken("<|reserved_special_token_228|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128234: AddedToken("<|reserved_special_token_229|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128235: AddedToken("<|reserved_special_token_230|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128236: AddedToken("<|reserved_special_token_231|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128237: AddedToken("<|reserved_special_token_232|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128238: AddedToken("<|reserved_special_token_233|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128239: AddedToken("<|reserved_special_token_234|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128240: AddedToken("<|reserved_special_token_235|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128241: AddedToken("<|reserved_special_token_236|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128242: AddedToken("<|reserved_special_token_237|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128243: AddedToken("<|reserved_special_token_238|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128244: AddedToken("<|reserved_special_token_239|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128245: AddedToken("<|reserved_special_token_240|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128246: AddedToken("<|reserved_special_token_241|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128247: AddedToken("<|reserved_special_token_242|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128248: AddedToken("<|reserved_special_token_243|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128249: AddedToken("<|reserved_special_token_244|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128250: AddedToken("<|reserved_special_token_245|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128251: AddedToken("<|reserved_special_token_246|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128252: AddedToken("<|reserved_special_token_247|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128253: AddedToken("<|reserved_special_token_248|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128254: AddedToken("<|reserved_special_token_249|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128255: AddedToken("<|reserved_special_token_250|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    128256: AddedToken("Bác", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
}

tokenizer.decode(tokenizer.encode("Bác"))
>>>B�c

It does NOT use the newly recently added token at all?!?!?! Why ? Any help please. Must be something missed. Steve

VishnuPJ commented 6 months ago

When adding a new token , tokenizer.add_tokens(['ininin']) and resizing model.resize_token_embeddings(len(tokenizer)) , I am getting the error, "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn". But when doing tokenizer.add_tokens(['inin']) there is no error?

Why is that? @ArthurZucker @thusinh1969

StephennFernandes commented 6 months ago

@VishnuPJ are you saving the tokenizer and then expanding the token embedding by loading the tokenizer freshly ?

I don't understand you error clearly, can you elaborate more

from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast tokenizer_new = Tokenizer.from_pretrained("thusinh1969/llama-3-VN-CN-Ancient-tokenizer") tokenizer_new_fast = PreTrainedTokenizerFast(tokenizer_object=tokenizer_new)

trying doing this and then saving the fast tokenizer then freshly load the tokenizer as usual and try to expand token embedding

VishnuPJ commented 6 months ago

@VishnuPJ are you saving the tokenizer and then expanding the token embedding by loading the tokenizer freshly ?

I don't understand you error clearly, can you elaborate more

from tokenizers import Tokenizer from transformers import PreTrainedTokenizerFast tokenizer_new = Tokenizer.from_pretrained("thusinh1969/llama-3-VN-CN-Ancient-tokenizer") tokenizer_new_fast = PreTrainedTokenizerFast(tokenizer_object=tokenizer_new)

trying doing this and then saving the fast tokenizer then freshly load the tokenizer as usual and try to expand token embedding

Sorry for the confusion. I was able to add the tokens and tokenizer works as expected. But whie running trainer.train() I am getting the above error.

StephennFernandes commented 6 months ago

@VishnuPJ ok seems like a trainer issue.

@thusinh1969 can you check what this issue could actually be ?

Id recommend cross checking your code with Chinese LLama alpaca 2 incase you haven't already.

besides this I feel only @ArthurZucker and/or @osanseviero could help us out in this

ArthurZucker commented 6 months ago

Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('Bác', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác'
ArthurZucker commented 6 months ago

Since the strings are pre-tokenized to their bytelevel representation (it's not a normalization) then you need to add it using pre_tokenizers.ByteLevel(False,False).pre_tokenize_str

StephennFernandes commented 6 months ago

Thanks a lot @ArthurZucker 😊

it really means a ton !!

thusinh1969 commented 6 months ago

Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('Bác', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác'

Does not help. This will create 3 tokens for 1 word "Bác" which is exactly what we want to avoid. Should be only 1 token.

tokenizer.encode("Bác", add_special_tokens=False)
>>>[33, 1995, 66]

This is very ineffective. Steve

ArthurZucker commented 6 months ago

Mmm no then it's not added properly, let me try again, sorry forgot to check the ids

ArthurZucker commented 6 months ago

Ok:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.encode("Bác")
128256 # a new token

this is alright, the only issue is the decoding. Let me find a fix and if needed update tokenizers to support this

VishnuPJ commented 6 months ago

@VishnuPJ ok seems like a trainer issue.

@thusinh1969 can you check what this issue could actually be ?

Id recommend cross checking your code with Chinese LLama alpaca 2 incase you haven't already.

besides this I feel only @ArthurZucker and/or @osanseviero could help us out in this

This issue is resolved. We need to add the below lines before calling get_peft_model(model, lora_config).

tokenizer.add_tokens(["NEW_TOKEN", "NEW_TOKEN_2"])  
model.resize_token_embeddings(len(tokenizer))  

Previously I added those lines after get_peft_model(), which somehow messes the model and tokenizer I guess.

StephennFernandes commented 6 months ago

Ok:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.encode("Bác")
128256 # a new token

this is alright, the only issue is the decoding. Let me find a fix and if needed update tokenizers to support this

@ArthurZucker so just for clarification the decoder produces char /byte based tokenization while decoding ?

ArthurZucker commented 6 months ago

Yep overall the token that was added is Bác, then it gets encoded, and the decoder tries to decode Bác as if it was Bác thus failing

ArthurZucker commented 6 months ago

I think the easiest solution is to simply make sure the Bytelevel decoder does not process the added tokens

hpsun1109 commented 6 months ago

Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string:

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('Bác', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác'

@ArthurZucker How to output bos_token? It doesn't work when I set "tokenizer.add_bos_token = True" Thanks

ArthurZucker commented 6 months ago

https://github.com/huggingface/tokenizers/pull/1513 will fix the issue for the new tokens

thusinh1969 commented 6 months ago

huggingface/tokenizers#1513 will fix the issue for the new tokens

Wonderful. How would it merged to which repo so we can get back and test.

Cheers, Steve

dengxiaotian123 commented 6 months ago

@ArthurZucker I am confused about the Tokenizer for tiktoken training. What encoding is used for the corpus (such as cl100k_base, or p50k_base) when training the Tokenizer? What is the encoding of these characters? For example,['åIJ¦', 'ãĢĤ']

['Tôi', 'ĠnhỼ', 'Ġl', 'ắm', 'ĠB', 'ác', 'ĠHá»ĵ', 'ĠkÃŃnh', 'Ġyêu', 'Ġcủa', 'ĠÄijạo', 'Ġph', 'áºŃt']

Input Chinese characters and output similar to this

word = "否。"
print('word', word)
print(tokenizer.tokenize(word))
print(tokenizer(word).input_ids)
print('decode: ', tokenizer.decode(tokenizer(word).input_ids))`
word 否。
['åIJ¦', 'ãĢĤ']  
[33476, 1811]
ArthurZucker commented 6 months ago

It is a unicode representation of the bytes! https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py#L1478 should give you an idea of how we obtain the tokens! What you do is that you take word = "否。", you encode it to bytes:

>>> word = "否。"
>>> bword = b'\xe5\x90\xa6'
>>> decoded = b'\xe5\x90\xa6'.decode("latin-1")
>>> [ord(char) for char in decoded.decode("latin-1")] 
[229, 144, 166]

then you fetch the unicode representation of the bytes (which are supposed to come from utf-8):

{33: '!', 34: '"', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(', 41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@', 65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H', 73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P', 81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X', 89: 'Y', 90: 'Z', 91: '[', 92: '\\', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a', 98: 'b', 99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i', 106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q', 114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y', 122: 'z', 123: '{', 124: '|', 125: '}', 126: '~', 161: '¡', 162: '¢', 163: '£', 164: '¤', 165: '¥', 166: '¦', 167: '§', 168: '¨', 169: '©', 170: 'ª', 171: '«', 172: '¬', 174: '®', 175: '¯', 176: '°', 177: '±', 178: '²', 179: '³', 180: '´', 181: 'µ', 182: '¶', 183: '·', 184: '¸', 185: '¹', 186: 'º', 187: '»', 188: '¼', 189: '½', 190: '¾', 191: '¿', 192: 'À', 193: 'Á', 194: 'Â', 195: 'Ã', 196: 'Ä', 197: 'Å', 198: 'Æ', 199: 'Ç', 200: 'È', 201: 'É', 202: 'Ê', 203: 'Ë', 204: 'Ì', 205: 'Í', 206: 'Î', 207: 'Ï', 208: 'Ð', 209: 'Ñ', 210: 'Ò', 211: 'Ó', 212: 'Ô', 213: 'Õ', 214: 'Ö', 215: '×', 216: 'Ø', 217: 'Ù', 218: 'Ú', 219: 'Û', 220: 'Ü', 221: 'Ý', 222: 'Þ', 223: 'ß', 224: 'à', 225: 'á', 226: 'â', 227: 'ã', 228: 'ä', 229: 'å', 230: 'æ', 231: 'ç', 232: 'è', 233: 'é', 234: 'ê', 235: 'ë', 236: 'ì', 237: 'í', 238: 'î', 239: 'ï', 240: 'ð', 241: 'ñ', 242: 'ò', 243: 'ó', 244: 'ô', 245: 'õ', 246: 'ö', 247: '÷', 248: 'ø', 249: 'ù', 250: 'ú', 251: 'û', 252: 'ü', 253: 'ý', 254: 'þ', 255: 'ÿ', 0: 'Ā', 1: 'ā', 2: 'Ă', 3: 'ă', 4: 'Ą', 5: 'ą', 6: 'Ć', 7: 'ć', 8: 'Ĉ', 9: 'ĉ', 10: 'Ċ', 11: 'ċ', 12: 'Č', 13: 'č', 14: 'Ď', 15: 'ď', 16: 'Đ', 17: 'đ', 18: 'Ē', 19: 'ē', 20: 'Ĕ', 21: 'ĕ', 22: 'Ė', 23: 'ė', 24: 'Ę', 25: 'ę', 26: 'Ě', 27: 'ě', 28: 'Ĝ', 29: 'ĝ', 30: 'Ğ', 31: 'ğ', 32: 'Ġ', 127: 'ġ', 128: 'Ģ', 129: 'ģ', 130: 'Ĥ', 131: 'ĥ', 132: 'Ħ', 133: 'ħ', 134: 'Ĩ', 135: 'ĩ', 136: 'Ī', 137: 'ī', 138: 'Ĭ', 139: 'ĭ', 140: 'Į', 141: 'į', 142: 'İ', 143: 'ı', 144: 'IJ', 145: 'ij', 146: 'Ĵ', 147: 'ĵ', 148: 'Ķ', 149: 'ķ', 150: 'ĸ', 151: 'Ĺ', 152: 'ĺ', 153: 'Ļ', 154: 'ļ', 155: 'Ľ', 156: 'ľ', 157: 'Ŀ', 158: 'ŀ', 159: 'Ł', 160: 'ł', 173: 'Ń'}

this basically allows you to represent any byte array in unicodes, simplifying the tokenization process. The idea is to show 'åIJ¦' as a token instead of showing b'\xe5\x90\xa6'

ArthurZucker commented 6 months ago

(\xe5 give 229: 'å', \x90 gives 144: 'IJ', etc )

thusinh1969 commented 6 months ago

Gents and @ArthurZucker is the decoder fixes merged already somewhere ?

Thanks, Steve

ArthurZucker commented 6 months ago

https://github.com/huggingface/tokenizers/pull/1513 can be used, gonna merge today and prepare the update to transformers + tokenizers release

StephennFernandes commented 6 months ago

@amitsangani @ArthurZucker

how do i train tiktoken tokenizer from scratch ? i see even Phi-3 uses tiktoken tokenizer. bu i cannot find any documentation on how to train the tiktoken tokenizer.

all help would be greatly appreciated.

thusinh1969 commented 6 months ago

@amitsangani @ArthurZucker

how do i train tiktoken tokenizer from scratch ? i see even Phi-3 uses tiktoken tokenizer. bu i cannot find any documentation on how to train the tiktoken tokenizer.

all help would be greatly appreciated.

Train sentence piece and merge, see above code. But its decoder is buggy, hence have to wait for the change to merge into HF's tokenizers package.

@ArthurZucker when should we expect the change to be part of oficial tokenizers package ?

Thanks, Steve

StephennFernandes commented 6 months ago

@amitsangani @ArthurZucker

how do i train tiktoken tokenizer from scratch ? i see even Phi-3 uses tiktoken tokenizer. bu i cannot find any documentation on how to train the tiktoken tokenizer.

all help would be greatly appreciated.

Train sentence piece and merge, see above code. But its decoder is buggy, hence have to wait for the change to merge into HF's tokenizers package.

@ArthurZucker when should we expect the change to be part of oficial tokenizers package ?

Thanks, Steve

I know that we could train spm and merge but that's not the point is there a way we could train tiktoken from scratch was my actual query.

as i see even other orgs using their own custom trained versions of tiktoken like phi-3 model used

thusinh1969 commented 5 months ago

Gents,

I installed tokenizers from source (tokenizers-0.19.1.dev0) from main branch. It is now working.

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác")), tokenizer.encode("Bác")
('Bác', [128256])

I am closing the issue, we can now extend vocab and continually pretrain LlaMA-3 further.

Thanks @ArthurZucker et all, Steve

ArthurZucker commented 5 months ago

🤗 glad I was of help! @StephennFernandes I don't know tiktoken, I can help you train from scratch using tokenizers but otherwise it's outside my domain of knowledge!

woohwan commented 5 months ago

Gents,

I installed tokenizers from source (tokenizers-0.19.1.dev0) from main branch. It is now working.

>>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác")), tokenizer.encode("Bác")
('Bác', [128256])

I am closing the issue, we can now extend vocab and continually pretrain LlaMA-3 further.

Thanks @ArthurZucker et all, Steve

I'm newbie in LLM field.

I want to extend llama3 tokenizer through korean corpus. Can you tell me where to modify when following https://huggingface.co/learn/nlp-course/chapter6/2? The result of tokenize does not change when I do the same.

Anyone help. plz. thanks.

Yuhuajoe commented 5 months ago
>>> word = "否。"
>>> bword = b'\xe5\x90\xa6'
>>> decoded = b'\xe5\x90\xa6'.decode("latin-1")
>>> [ord(char) for char in decoded.decode("latin-1")] 
[229, 144, 166]

@ArthurZucker when i run the code above, i got the error as below:

          2 bword = b'\xe5\x90\xa6'
          3 decoded = b'\xe5\x90\xa6'.decode("latin-1")
----> 4 [ord(char) for char in decoded.decode("latin-1")] 

AttributeError: 'str' object has no attribute 'decode'
ArthurZucker commented 4 months ago

Hey! Decoded is already a string, you probably wanted to do [ord(char) for char in decoded] 😉

StephennFernandes commented 4 months ago

@amitsangani hey amit could you please tell us how to pretrain the tokenizer from scratch using tiktoken like you guys did for training llama 3