Open daehuikim opened 11 months ago
Hi @daehuikim we've just released all the training code for Zephyr, so it should now be possible to reproduce our models with a few lines of code :)
@lewtun Thanks for giving such a good example! I am still curious how to adapt my own dataset that is not on hub. For example, when i try to train with config like below,
datasets_mixer:
dataset_1: 0.5 # Use 50% of the training examples
dataset_2: 0.66 # Use 66% of the training examples
dataset_3: 0.10 # Use 10% of the training examples
localdataset_name:0.001 #how can it be done?
dataset_splits:
- train_xxx # The training splits to mix
- test_xxx # The test splits to mix
FileNotFoundError: Couldn't find a dataset script at localdataset_name.py or any data file in the same directory.
I got errors like this. I hope to adapt my dataset which exist in local disk to models. So I checked utils.py in alignment also however, I have no idea how to modify this. If you have any idea or update, I am pleased to be informed again please. :)
Hi @daehuikim. We did not consider this use case. Are you unable to push there dataset to the hub, even as a private dataset?
Otherwise you would need to use Dataset.from_dict
similar to load your dataset as a custom dataset or provide a custom dataset script file.
Hello, I am so impressed by your models. I tried fine tuning your models with my data and the evaulation_loss is not optimized as shown in the image above. In particular, the blue line is the llama-13b model, and you can see that the zephyr models are performing worse than the llama models when fine tuning, even though the MT performance is much better. Do you have any idea why this is? The script used in my job is based on basic SFTtrainer example on trl library https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py
Thank you!