ml-explore / mlx-examples

Examples in the MLX framework
MIT License
6.18k stars 873 forks source link

How can I edit the training dataset for Lora? #68

Closed nimesttech closed 10 months ago

nimesttech commented 11 months ago

I'm trying to finetune on Lora, but I can't find where the Training Dataset is located so I can create my own. Anyone managed to do it on their own dataset?

awni commented 11 months ago

This is the data loading part of the example.

Right now, the easiest way to change it is the following:

  1. Make a new python file mydata.py
  2. Put a class in it like MyData which holds your data in whatever format. As long as it has __getitem__ (which returns a string sample) and __len__ (which returns the length of the dataset) then it should work. See the wikisql as an example
  3. Put a load function in mydata.py which returns the training, validation, and test sets, example here
  4. Change the references to wikisql in lora.py to mydata

Sorry it's a bit cumbersome right now, but its conceptually pretty simple.

We can write a json loader or a text file loader for custom datasets if that would be helpful..

shamikbose commented 11 months ago

We can write a json loader or a text file loader for custom datasets if that would be helpful..

That would be very helpful! I would like to help if possible

awni commented 11 months ago

Happy to take a PR for this if you would like to implement it!

shamikbose commented 11 months ago

Playing around with it now. Looks like the changes might be minimal if we wanted to use HF datasets. Can we use streaming datasets?

nimesttech commented 11 months ago

One thing that would be REALLY healpful for a user point of view, would be to be able to determinate the path for a jsonl, csv or json file for the training dataset, similarly to how we do it on pytorch or on llm platforms such as openai for example

Having a json format for example {"prompt":, "completion":}

I think it is easier that way then already trying to implement a complicated dataset that extracts SQL queries

justinh-rahb commented 11 months ago

I just trained a 7B model on my workstation over my lunchbreak without having to rent the capability by the hour, which is very cool. A more generalized loader function for using alternative datasets would be extremely appreciated for further testing. I'm very impressed with what I've seen so far, and now I need to push it to it's logical conclusion :) This framework is going to be a game changer for ML development.

shamikbose commented 11 months ago

@justinh-rahb That's very cool indeed. By "train", I'm assuming you meant finetune. Were you using the wikisql dataset that's provided or did you use a custom dataset?

justinh-rahb commented 11 months ago

Yes, sorry, I did mean fine-tune, and yes I just tried the WikiSQL example first. M2 Max 12 core, 38 GPU cores, 96GB RAM, completed in around 45 minutes. I've also tried following @awni's suggestion and while I think it's working, I don't have enough data in the custom dataset I'm using currently as it stops training well short of the requested iterations:

% python lora.py --model mistral-7B-v0.1-MLX/ --train --iters 1000 --learning_rate 1e-6 --steps_per_eval 20
Loading pretrained model
Total parameters 7243.436M
Trainable parameters 1.704M
Loading datasets
Training
Iter 1: Val loss 5.710, Val took 0.409s
Iter 10: Train loss 4.486, It/sec 2.941, Tokens/sec 187.370
Iter 20: Train loss 4.701, It/sec 2.619, Tokens/sec 159.998
Iter 20: Val loss 5.704, Val took 0.408s

Again, right now I'd put this down to my dataset being too sparse, there were no errors otherwise.

awni commented 11 months ago

@justinh-rahb Right now the script only does one epoch over the data if the dataset can be processed in fewer than the maximum number of iterations. We could (and probably should) change that behavior so it just goes for the maximum number of iterations.

One way would be to change the iterate_batches function to take a infinite=True|False argument to infinitely loop over the training data rather than stop when it ends.

justinh-rahb commented 11 months ago

One way would be to change the iterate_batches function to take a infinite=True|False argument to infinitely loop over the training data rather than stop when it ends.

Cheers, that seems to have done the trick, though I still suspect I need more data :)

nimesttech commented 11 months ago

@justinh-rahb Well, when I fine tune models on OpenAI and Google Palm 2, I usually have a very small dataset with around 50 entries.

nimesttech commented 11 months ago

But I don't know if they use Llora or other finetuning technique

justinh-rahb commented 11 months ago

@justinh-rahb Well, when I fine tune models on OpenAI and Google Palm 2, I usually have a very small dataset with around 50 entries.

@nimesttech Hmm, ya I did fine-tune it with gpt-3.5 also just to sanity check my data. The dataset isn't a problem there, so I guess my hacky modifications either didn't work or I'm invoking generation incorrectly because I'm not getting great results yet. But I'll attribute that to skill issue most likely. Going to keep fiddling with it, and wait for more professional programmers to get up to speed on building out or extending tooling for this fantastic framework. The advantage of being able to do this locally now is I can afford to keep trying again and again, isn't costing me anything, the Mac's already paid for.

nimesttech commented 11 months ago

Thanks for your work guys. I'm more on the Solution Archtecture part and I code mostly in .Net so not much of a help on Python. But this will be very important because I'm proposing a Finetune first approach to all use-cases we work for any production application to be build with Foundation Models. And having this tool will help a lot to explain people that.

awni commented 10 months ago

Support for custom datasets in #115