[Multimodal] Adding OBELICS DataLoader

TJ-Solergibert commented 1 month ago

Hi!

I’ve started developing the Multimodal DataLoader. After taking a (deep) look at this whole multimodal universe, I would like to discuss a couple of things before continuing. I’m using the torchtune repo as a reference.

As we have already mentioned, the DataLoader will only be compatible with the OBELICS dataset. It’s worth noting that this is a nice dataset since it not only contains (Image, Text) pair samples but also other patterns like (Image, Image, Text, Image, Text) or (Text, Image, Image, Text), among others.
Iterable dataset: I assume the solution must be an Iterable Dataset, like the one already available for text-only pretraining. However, I think it’s necessary to consider the following:
- Unlike text-only pretraining, where we only read text and tokenize it, to create multimodal batches, we will have to carry out many more operations on the CPU, such as downloading the image, decoding it, resizing, etc., and even padding the inputs. We would need to assess to what extent this could cause a bottleneck, but it’s clear that we could alleviate this issue if we could use num_workers > 1 in the DataLoader, something we can’t (easily) do with an Iterable one.
- Also, as you mention in the text dataset, this option doesn’t allow shuffling the documents from the dataset. In fact, it even forces you to have multiple samples from the same document in the same batch if the document is long enough (I’m attaching an example). I’m not sure how relevant this may be, but I would expect to have multiple samples from different documents in each batch.
```
from torchtitan.datasets import build_hf_data_loader, build_tokenizer
```

tokenizer = build_tokenizer("tiktoken", "/workspace/mm/tokenizer.model") data_loader = build_hf_data_loader( dataset_name="c4", dataset_path=None, tokenizer=tokenizer, batch_size=4, seq_len=32, world_size=4, rank=0, )

batch = next(iter(data_loader)) input_ids, labels = batch

for idx, sample in enumerate(input_ids): print(f"| Sample {idx} | {tokenizer.decode(list(sample))}")

| Sample 0 | <|begin_of_text|>Beginners BBQ Class Taking Place in Missoula! Do you want to get better at making delicious BBQ? You will have the opportunity, put this on | Sample 1 | calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level | Sample 2 | for everyone who wants to get better with their culinary skills. He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques | Sample 3 | recipes, timelines, meat selection and trimming, plus smoker and fire information. The cost to be in the class is $35 per person, and for spectators it



- Packing: Just like in the SFT phase, the length of the samples is usually much shorter than the model's sequence length, so we usually pack multiple dataset samples into a single one. This is not straightforward, as we need to consider the following:
  - First, to pack correctly, it’s important to construct both the attention mechanism masks and inject into the model the position of each token relative to its sample, to correctly apply the position embeddings ([Nice torchtune explanation](https://github.com/pytorch/torchtune/blob/74139c9da0a0f92d4da5f9c918f48fd1680ba157/torchtune/datasets/_packed.py#L35-L63)). Currently, `torchtitan` doesn’t support introducing different position ids for each sample, as it directly uses a [precomputed one](https://github.com/pytorch/torchtitan/blob/ec337c3e1a1b1bb7c3ece3906e068e911169552e/torchtitan/models/llama_multimodal/model.py#L99). For images, `torchtitan` does consider the [image masks](https://github.com/pytorch/torchtitan/blob/ec337c3e1a1b1bb7c3ece3906e068e911169552e/torchtitan/models/llama_multimodal/model.py#L1128).
  - Next, we would need to establish a limit for the number of samples to pack. In the case of text, it’s relatively easy, as it packs samples until filling the sequence length. In this case, we would also need to consider the maximum number of images we want to have per sample.
  - Finally, if we want to use `batch size > 1` or SP, we will have to pad the samples. For the first case, it’s only necessary to pad to the longest sequence in the batch (and the longest number of images in the batch), while for the second case, we will have to pad the sequences to the model's sequence length, or else the SP `reduce_scatter` calls will fail.

I was surprised to see that torchtune doesn’t currently support this feature for MultiModal datasets, whereas it does for SFT ones. I think it’s necessary to develop a solution with packing to achieve maximum performance.

- Other comments:
  - In the `LearnableProjection` forward method, [this line](https://github.com/pytorch/torchtitan/blob/ec337c3e1a1b1bb7c3ece3906e068e911169552e/torchtitan/models/llama_multimodal/model.py#L937) is duplicated.
  - The MultiModal DataLoader will produce a different amount of elements than the [text one](https://github.com/pytorch/torchtitan/blob/1060feacc1b51cb6b339a04e53a5243b8466552b/train.py#L259). We need to study further whether it’s possible to maintain compatibility with `train.py`, but using [`TensorDict`](https://github.com/pytorch/tensordict) could be a good idea both for the model's forward pass (`model(**batch)`) and for device placement (`batch.cuda()`).

Without a doubt, this is a great (and fun) exercise to dive into multimodality! Let me know your thoughts!

Toni

cc: @tianyu-l @fduwjj

casper-hansen commented 1 month ago

A more general multimodal data solution might be using the following library. https://github.com/mosaicml/streaming

fduwjj commented 1 month ago

@TJ-Solergibert thanks for your comments.

Regarding what you said here:

Currently, torchtitan doesn’t support introducing different position ids for each sample, as it directly uses a precomputed one

This is an ongoing work and I plan to improve it as well. What you mentioned here is part of it.

We would need to assess to what extent this could cause a bottleneck, but it’s clear that we could alleviate this issue if we could use num_workers > 1 in the DataLoader, something we can’t (easily) do with an Iterable one.

We can use multiprocess dataloader but maybe we can start with a really slow first and then optimize it?

Next, we would need to establish a limit for the number of samples to pack Yes this is common in MM model.

For the sequence length, can we make the longest sequence length same as model seq length? Also for the trainer, ideally we want to reuse the current train.py. Or you can have your own prototype and we can then have an another discussion.

TJ-Solergibert commented 1 month ago

Hi @casper-hansen, thanks for your suggestion, but it's not a matter of loading "lot's of images efficiently at scale" but rather how to prepare the inputs for the model

TJ-Solergibert commented 1 month ago

Hi @fduwjj,

This is an ongoing work and I plan to improve it as well. What you mentioned here is part of it.

Nice! So I'll prepare a position_ids tensor with the same shape as input_ids

We can use multiprocess dataloader but maybe we can start with a really slow first and then optimize it?

Setting num_workers >1with an IterableDataset is not trivial. Let's begin with a first version using a IerableDataset with num_workers < 2 and hope that we manage to hide the DataLoader work with the training step.

For the sequence length, can we make the longest sequence length same as model seq length?

Yes, usually you pack sequences until filling up the seq length of the model BUT now you will also want to control the size of the encoder_inputs in the fusion layers. Imagine you pack 10 samples, which sum up to 6k tokens BUT contain 70 images that can produce OOM errors. You will have to check to not surpass the model seq length & a predefined limit of number of images.

Also for the trainer, ideally we want to reuse the current train.py. Or you can have your own prototype and we can then have an another discussion.

Yes, my intention is to maintain the compatibility with train.py. I think that if we switch the batches from the DataLoader to TensorDicts everything will run smoothly!

I will continue working over the weekend on a first prototype. So far it's looking great, now I have to figure out which is the best way to pack multiple samples properly respecting both the masks from the tokens & the encoder_mask's.

Toni

tianyu-l commented 4 weeks ago

On the necessity of shuffling:

Also, as you mention in the text dataset, this option doesn’t allow shuffling the documents from the dataset. In fact, it even forces you to have multiple samples from the same document in the same batch if the document is long enough (I’m attaching an example). I’m not sure how relevant this may be, but I would expect to have multiple samples from different documents in each batch.

I'd assume that most of the time, the sample/document is less than the max_seq_length of training, as you also mentioned

Just like in the SFT phase, the length of the samples is usually much shorter than the model's sequence length, so we usually pack multiple dataset samples into a single one.

If consecutive samples are all from the same source, then what needs to be done is either (1) (if training is still done at the sample level) data preprocess which falls outside the scope of this repo, or (2) (o/w) we should support longer sequence length to cover most full documents.

tianyu-l commented 3 weeks ago

@andrewkho wonder if the PyTorch dataloading solution would be a good fit here

andrewkho commented 3 weeks ago

Hi @tianyu-l yes definitely a good fit here. Hi @TJ-Solergibert and everyone, I'm coming from pytorch/data side of things and think we have some things up our sleeve we could propose that would help here. We're also in contact with the torchtune folks. Let's spend some time testing out some solutions and hopefully find some common ground.

TJ-Solergibert commented 3 weeks ago

Hi @tianyu-l & @andrewkho,

I've recently submitted #663 with a first prototype. Most of the code comes from torchtune. I also provide some evidence on why we should develop a solution that is able to pack multiple samples from the Dataset. In short, if we don't do so we will need to pad every sample to the maximum number of images in the batch where every image has shape [Number of tiles, Channels, Tile size, Tile size] --> [4, 3, 448, 448]. And there are samples with LOT'S of images, so this provoques that the majority of the inputs are useless padding tokens. Despite the interest of torchtitan on incorporating a solution with packing or not, I will work on that feature nevertheless.

Toni

pytorch / torchtitan

[Multimodal] Adding OBELICS DataLoader #650

for idx, sample in enumerate(input_ids): print(f"| Sample {idx} | {tokenizer.decode(list(sample))}")