Closed asphytheghoul closed 7 months ago
Hey, I have no idea about that. It depends on your data, your hardware and your installation. Would just recommend you to make sure you are leveraging parallelism!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi i was just trying to train a new tokenizer from the original llama-2 tokenizer and my dataset has around 5 million samples (1.2GB) txt file. I have pre-processed the text separately and written it to the file which i am loading using the dataloader and passing that to the function as suggested by @ArthurZucker on #1345 . I wanted to understand how long this process should take ? this is the code i am using :
transformers - 4.36.2 tokenizers - 0.15.0 python - 3.9.17