Closed nimesttech closed 10 months ago
This is the data loading part of the example.
Right now, the easiest way to change it is the following:
mydata.py
MyData
which holds your data in whatever format. As long as it has __getitem__
(which returns a string sample) and __len__
(which returns the length of the dataset) then it should work. See the wikisql as an exampleload
function in mydata.py
which returns the training, validation, and test sets, example herewikisql
in lora.py
to mydata
Sorry it's a bit cumbersome right now, but its conceptually pretty simple.
We can write a json loader or a text file loader for custom datasets if that would be helpful..
We can write a json loader or a text file loader for custom datasets if that would be helpful..
That would be very helpful! I would like to help if possible
Happy to take a PR for this if you would like to implement it!
Playing around with it now. Looks like the changes might be minimal if we wanted to use HF datasets. Can we use streaming datasets?
One thing that would be REALLY healpful for a user point of view, would be to be able to determinate the path for a jsonl, csv or json file for the training dataset, similarly to how we do it on pytorch or on llm platforms such as openai for example
Having a json format for example {"prompt":
I think it is easier that way then already trying to implement a complicated dataset that extracts SQL queries
I just trained a 7B model on my workstation over my lunchbreak without having to rent the capability by the hour, which is very cool. A more generalized loader function for using alternative datasets would be extremely appreciated for further testing. I'm very impressed with what I've seen so far, and now I need to push it to it's logical conclusion :) This framework is going to be a game changer for ML development.
@justinh-rahb That's very cool indeed. By "train", I'm assuming you meant finetune. Were you using the wikisql dataset that's provided or did you use a custom dataset?
Yes, sorry, I did mean fine-tune, and yes I just tried the WikiSQL example first. M2 Max 12 core, 38 GPU cores, 96GB RAM, completed in around 45 minutes. I've also tried following @awni's suggestion and while I think it's working, I don't have enough data in the custom dataset I'm using currently as it stops training well short of the requested iterations:
% python lora.py --model mistral-7B-v0.1-MLX/ --train --iters 1000 --learning_rate 1e-6 --steps_per_eval 20
Loading pretrained model
Total parameters 7243.436M
Trainable parameters 1.704M
Loading datasets
Training
Iter 1: Val loss 5.710, Val took 0.409s
Iter 10: Train loss 4.486, It/sec 2.941, Tokens/sec 187.370
Iter 20: Train loss 4.701, It/sec 2.619, Tokens/sec 159.998
Iter 20: Val loss 5.704, Val took 0.408s
Again, right now I'd put this down to my dataset being too sparse, there were no errors otherwise.
@justinh-rahb Right now the script only does one epoch over the data if the dataset can be processed in fewer than the maximum number of iterations. We could (and probably should) change that behavior so it just goes for the maximum number of iterations.
One way would be to change the iterate_batches
function to take a infinite=True|False
argument to infinitely loop over the training data rather than stop when it ends.
One way would be to change the
iterate_batches
function to take ainfinite=True|False
argument to infinitely loop over the training data rather than stop when it ends.
Cheers, that seems to have done the trick, though I still suspect I need more data :)
@justinh-rahb Well, when I fine tune models on OpenAI and Google Palm 2, I usually have a very small dataset with around 50 entries.
But I don't know if they use Llora or other finetuning technique
@justinh-rahb Well, when I fine tune models on OpenAI and Google Palm 2, I usually have a very small dataset with around 50 entries.
@nimesttech Hmm, ya I did fine-tune it with gpt-3.5 also just to sanity check my data. The dataset isn't a problem there, so I guess my hacky modifications either didn't work or I'm invoking generation incorrectly because I'm not getting great results yet. But I'll attribute that to skill issue most likely. Going to keep fiddling with it, and wait for more professional programmers to get up to speed on building out or extending tooling for this fantastic framework. The advantage of being able to do this locally now is I can afford to keep trying again and again, isn't costing me anything, the Mac's already paid for.
Thanks for your work guys. I'm more on the Solution Archtecture part and I code mostly in .Net so not much of a help on Python. But this will be very important because I'm proposing a Finetune first approach to all use-cases we work for any production application to be build with Foundation Models. And having this tool will help a lot to explain people that.
Support for custom datasets in #115
I'm trying to finetune on Lora, but I can't find where the Training Dataset is located so I can create my own. Anyone managed to do it on their own dataset?