johko / computer-vision-course

This repo is the homebase of a community driven course on Computer Vision with Neural Networks. Feel free to join us on the Hugging Face discord: hf.co/join/discord
MIT License
376 stars 124 forks source link

Multimodal Transfer Learning: Draft outline #56

Closed cfalholt closed 2 months ago

cfalholt commented 8 months ago

Hi CV course contributors, We would love to hear your feedback on the multimodal transfer learning section of the course. Here's the current general outline with some of the thoughts we've done in the team. What do you think of the following outline:

1. Introduction

2. Zero-shot multi-modality

3. Full finetuning

Discussion Point: Full fine-tuning may be impractical for multimodal models due to their size. Should we still include it for educational value or focus solely on PEFT?

4. Parameter efficient fine tuning (PEFT)

5. Final remarks

johko commented 8 months ago

Hey @cfalholt,

thank you for the draft outline :hugs:

1. Introduction

Keep the part about different fine-tuning methods really short, as the main description should be in the first transfer learning chapter of the course (on CNNs). It is good to have a reminder here for the people, but you don't need to be exhaustive on that.

I like the idea of introducing a task here already as an example and you can also can connect that to the dataset loading and inspecting.

  1. Zero-shot multi-modality

Good point and quite special for the multimodal models. We will also have a dedicated chapter on that, https://github.com/johko/computer-vision-course/issues/43, but I also see it as an important point here. You can focus on being really hands-on and give a lot of examples.

  1. Full finetuning

You question already states the biggest issue here - finetuning big multimodal models might be rather costly. But I would still not skip this section completely, maybe don't use a multimodal with an LLM connection as an example here, but rather something like LayoutLM or OWL-ViT, as they are potentially easier to train (I guess)

4. Parameter efficient fine tuning (PEFT)

Again, try to sync with the other Fine-Tuning teams (especially https://github.com/johko/computer-vision-course/issues/53) to avoid overlap, but apart from that the section sounds good

5. Final remarks :+1:

In general I think the Transfer Learning chapter (as the other ones before that) should mainly focus on practical applications and not so much on theory. A bit like the jupyter noteboooks from @NielsRogge here: https://github.com/NielsRogge/Transformers-Tutorials

Hope that helps :slightly_smiling_face:

minemile commented 8 months ago

Hello @johko! Thank you very much for your comments!

Keep the part about different fine-tuning methods really short, as the main description should be in the first transfer learning chapter of the course (on CNNs).

Yeah, we will try to do this as a reminder for people who are only interested in this part of the course

We will also have a dedicated chapter on that, #43, but I also see it as an important point here. You can focus on being really hands-on and give a lot of examples.

I completely agree. We will need to make sure that our examples do not overlap with #43. However, judging by their outline, this should not be the case.

You question already states the biggest issue here - finetuning big multimodal models might be rather costly. But I would still not skip this section completely, maybe don't use a multimodal with an LLM connection as an example here, but rather something like LayoutLM or OWL-ViT, as they are potentially easier to train (I guess)

A difficult question, on the one hand, full-fine tuning is the simplest process of training a model for a specific task and this is perfect for educational purposes, but on the other hand, especially in practice, when training multimodals models, it is more effective to use PEFT.

What do you think about this? Should we add an example of full-fine tuning?

Again, try to sync with the other Fine-Tuning teams (especially #53) to avoid overlap, but apart from that the section sounds good

I think we will inevitably encounter overlaps because there are not too many transfer learning methods that are specialized for specific types of models. Probably, the theoretical part in our chapters will be quite similar, but the practical parts will be different due to differences in the models

In general I think the Transfer Learning chapter (as the other ones before that) should mainly focus on practical applications and not so much on theory.

Absolutely agree!

I have also prepared a list of tasks and models that we can cover in this chapter:

Can you please write feedback, maybe there is something not worth considering, or vice versa, it would be great to add.

johko commented 7 months ago

Hey @minemile

sounds very good overall :slightly_smiling_face:

Depending the full fine-tuning it really depends. If you can find a model and dataset that is cheap to run and good for educational purposes this would be a great part and a nice feeling of success for the participants. If you don't find anything good, don't try to hard. Cover the theory of full fine-tuning, but mention the difficulties and obstacles and maybe use that as an introduction why thinks like PEFT are so helpful now.

I also like the tasks and connected models, it feels like a good and expressive variety :hugs: