e2e torchtune tutorial example

seyeint commented 4 months ago

It would be positive to get something a bit more hands on than the current end to end tutorial. It's not that the end to end "workflow" is incomplete, but personally, it gave me a good intuition about how your framework works but it didn't really feel like after that I could go ahead and play alone, without maybe waiting for people to build end to end projects on top of this.

basic programming steps (not only shell commands, something in the middle of functional programming and OOP) like:

load the base model;
load the dataset we want to fine tune upon, maybe showing how it would be if we wanted to do our own? meaning what the workflow would be starting from a regular pandas dataframe and not necessarily calling a 1 line script;
train/finetune the base model;
save the LoRA adapter;
reload the base model at x precision;
merge the LoRA weights with the base model (this is probably automatically done in the training phase, so ignore if thats the case);
Save the merged model and push to Hugging face Hub (this tbh is already enough what you have on your pages).

personally, i've used different tools for different projects and I think a finetune example mini project that's not shell commands on a-priori assembled files could help users. for example the concept of recipe is very nice and important, it could also enter the end of this possible future tutorial where we save it and maybe even load it into another total different model to try it, if thats even possible?

it doesn't have to have these exact steps, it's just a general idea :)

ebsmothers commented 4 months ago

Thanks @seyeint for creating the issue! This is very valuable feedback. I agree that our current tutorials are a little bit too restricted to CLI commands. We are looking to build out our documentation with meatier examples and I think your suggestion makes a lot of sense.

A couple comments on specific points you raised:

Re saving LoRA adapters and weight merging, we actually do handle both of these already. So when you run a fine-tune, we do the weight merge before saving the checkpoint. That way you get out a fine-tuned checkpoint in the exact same format you started with and you can immediately use it for inference or whatever else. We also separately save just the adapter weights. But as I read through our docs I think this is not super clear and we should definitely emphasize it more.
Re custom datasets, I think this is a really good point. We have a datasets tutorial (and are currently making updates to it to better emphasize preparation of a custom dataset), but I agree it'd be useful for us to actually demonstrate this as part of the end-to-end flow since it's something most people will have to deal with.
Re reloading the model at a different precision, we do give an example of quantizing the fine-tuned model towards the end of the tutorial. Do you feel this is unclear or could use more detail? Or is there something else entirely you have in mind here? Would love to hear your thoughts on how we can make this part clearer.

I have some other thoughts here but will leave it there for now. Let me know if this makes sense or you disagree with any of the points, thanks!

seyeint commented 4 months ago

so I would say that regarding the capabilities of your framework, I'm already predicting you have everything we need, but the first Re point you made is good to hear. Every one of my points was more focused on the ideas for the tutorial and not so much me asking if your amazing framework was capable of reaching it.

regarding your last point, you're right and I would also say that I was communicating more about what could be on that tutorial (which obviously would have a lot of overlap with your end to end workflow that is already in place, but you really nailed it in the first phrase when you said "restricted to CLI commands". I think it makes total sense for the docs to start this way and evolve to more meatier as they evolve and eventually I'm not really imagining anything else needed.

To me, and obviously this goes from person to person, there's an asymmetry between profiles who are really into the practical level of fine tuning through other frameworks that maybe don't even have that great of documentation, but just the fact that because they were the go-to earlier than torch.tune, it's really easy for me to just navigate implementations of individuals and even mediums end to end pipelines that really do justice to the simplicity that is in fine tuning (even under the hood with no CLI).

My "fitness" evaluation for your docs would be: if anyone with decent field experience (who maybe worked with DL but not so much the fine tuning of LLMs), whether more theoretical or practical is able to open their IDE and create a small project that involves tuning a network with a dataframe of their choice, evaluate (if possible in their case), messing with some mechanics of the LoRA class they applied, save the recipe and feel prepared to use that same recipe for other future models... then it means that the docs are immaculate.

AsymptoticBrain commented 4 months ago

Yeah, I'd love to see that aswell, I'm a little stumped now looking at the E2E example and having to somehow dockerize it and have this run in a CI pipeline.

ebsmothers commented 4 months ago

@AsymptoticBrain thanks for the feedback here. Admittedly we do not have anything on docker in our docs and this is something we're hoping to improve upon. Personally I think your suggestion would merit its own standalone documentation, as it is kind of disjoint from the E2E tutorial, which is more focused on the library's functionality rather than the engineering work needed to integrate this into a production workflow.

In the meantime, @tcapelle had put together a Docker and shared it here, this may be useful as a starting point. (Obviously pointing to a comment on an issue is not addressing the lack of proper documentation, but at least some stopgap to hopefully help you out in the short-term.) Regarding CI pipelines, can you give a bit more detail here? What sort of documentation would you find useful?

AsymptoticBrain commented 4 months ago

Oh no that's really helpful regardless 😊 when it comes to CI/CD pipelines yeah it's more of how we have our MLops setup, I think many are in a position where we are behind firewalls and use sensitive data so we need to have all that data and models local. I'll play around a little with it some if I can get the docker working and come back to you.

RdoubleA commented 1 month ago

@SalmanMohammadi will the custom recipe tutorial in #1196 address this?

SalmanMohammadi commented 1 month ago

@SalmanMohammadi will the custom recipe tutorial in #1196 address this?

Thanks for the ping, I hadn't seen this issue! This is some great feedback and I'll definitely be incorporating some of the ideas here.

pytorch / torchtune

e2e torchtune tutorial example #825