huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k stars 777 forks source link

tokenizer.train_new_from_iterator() takes time #1434

Closed asphytheghoul closed 7 months ago

asphytheghoul commented 8 months ago

Hi i was just trying to train a new tokenizer from the original llama-2 tokenizer and my dataset has around 5 million samples (1.2GB) txt file. I have pre-processed the text separately and written it to the file which i am loading using the dataloader and passing that to the function as suggested by @ArthurZucker on #1345 . I wanted to understand how long this process should take ? this is the code i am using :

from datasets import DatasetDict
import torch
import numpy as np
import pandas as pd
from transformers import AutoTokenizer
import transformers
import torch
import re
from datasets import load_dataset

model = "meta-llama/Llama-2-7b-hf"
old_tokenizer = AutoTokenizer.from_pretrained(model)

with open("./baarat-hi-sml.txt", "w",encoding="utf-8") as f_hi:
    for it in processed_dataset:
        f_hi.write(it["tgt"] + "\n")

file_path = "./baarat-hi-sml.txt"
print("Data written successfully")
class TextDataset(torch.utils.data.Dataset):
    def __init__(self, file_path, batch_size):
        self.batch_size = batch_size
        self.lines = []

        with open(file_path, 'r', encoding='utf-8') as file:
            for line in file:
                self.lines.append(line.strip())

    def __len__(self):
        return len(self.lines)

    def __getitem__(self, idx):
        batch = self.lines[idx:idx + self.batch_size]
        return batch

# Create the dataset, and process the full file.
dataset = TextDataset(file_path, batch_size=1024)

# DataLoader for efficient batch processing
dataloader = torch.utils.data.DataLoader(dataset, batch_size=None)
tokenizer = old_tokenizer.train_new_from_iterator(dataloader, 52000,new_special_tokens=["<pad>","[INST]","[/INST]"])

image

transformers - 4.36.2 tokenizers - 0.15.0 python - 3.9.17

ArthurZucker commented 8 months ago

Hey, I have no idea about that. It depends on your data, your hardware and your installation. Would just recommend you to make sure you are leveraging parallelism!

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.