sambanova / generative_data_prep

Apache License 2.0
58 stars 8 forks source link

Create FAQ section #85

Open snova-zoltanc opened 7 months ago

snova-zoltanc commented 7 months ago

Create an FAQ section in the readme that answers common questions or edge cases

  1. How to access llama tokenizer - maybe recommend un-gated version daryl149/llama-2-7b-chat-hf?
  2. Make sure that number of sequence per dataset file is >= batch size during training
  3. Make sure that the number of files is >= the number of data parallel workers

Please feel free to add more common issues / FAQs

snova-zoltanc commented 5 months ago

@snova-connorm Not sure if we want to do this or not now that the documentation has been improved. Maybe something like covering the most common errors like gated HuggingFace tokenizers, How to load a tokenizer, what packing config to pick, some common errors like json parse and how to fix etc.

Let me know if you think this is worth it