Open TJ-Solergibert opened 1 month ago
A more general multimodal data solution might be using the following library. https://github.com/mosaicml/streaming
@TJ-Solergibert thanks for your comments.
Regarding what you said here:
Currently, torchtitan doesn’t support introducing different position ids for each sample, as it directly uses a precomputed one
This is an ongoing work and I plan to improve it as well. What you mentioned here is part of it.
We would need to assess to what extent this could cause a bottleneck, but it’s clear that we could alleviate this issue if we could use num_workers > 1 in the DataLoader, something we can’t (easily) do with an Iterable one.
We can use multiprocess dataloader but maybe we can start with a really slow first and then optimize it?
Next, we would need to establish a limit for the number of samples to pack Yes this is common in MM model.
For the sequence length, can we make the longest sequence length same as model seq length? Also for the trainer, ideally we want to reuse the current train.py
. Or you can have your own prototype and we can then have an another discussion.
Hi @casper-hansen, thanks for your suggestion, but it's not a matter of loading "lot's of images efficiently at scale" but rather how to prepare the inputs for the model
Hi @fduwjj,
This is an ongoing work and I plan to improve it as well. What you mentioned here is part of it.
Nice! So I'll prepare a position_ids
tensor with the same shape as input_ids
We can use multiprocess dataloader but maybe we can start with a really slow first and then optimize it?
Setting num_workers >1
with an IterableDataset is not trivial. Let's begin with a first version using a IerableDataset with num_workers < 2
and hope that we manage to hide the DataLoader work with the training step.
For the sequence length, can we make the longest sequence length same as model seq length?
Yes, usually you pack sequences until filling up the seq length of the model BUT now you will also want to control the size of the encoder_inputs
in the fusion layers. Imagine you pack 10 samples, which sum up to 6k tokens BUT contain 70 images that can produce OOM errors. You will have to check to not surpass the model seq length & a predefined limit of number of images.
Also for the trainer, ideally we want to reuse the current
train.py
. Or you can have your own prototype and we can then have an another discussion.
Yes, my intention is to maintain the compatibility with train.py
. I think that if we switch the batches from the DataLoader to TensorDicts everything will run smoothly!
I will continue working over the weekend on a first prototype. So far it's looking great, now I have to figure out which is the best way to pack multiple samples properly respecting both the masks from the tokens & the encoder_mask
's.
Toni
On the necessity of shuffling:
Also, as you mention in the text dataset, this option doesn’t allow shuffling the documents from the dataset. In fact, it even forces you to have multiple samples from the same document in the same batch if the document is long enough (I’m attaching an example). I’m not sure how relevant this may be, but I would expect to have multiple samples from different documents in each batch.
I'd assume that most of the time, the sample/document is less than the max_seq_length of training, as you also mentioned
Just like in the SFT phase, the length of the samples is usually much shorter than the model's sequence length, so we usually pack multiple dataset samples into a single one.
If consecutive samples are all from the same source, then what needs to be done is either (1) (if training is still done at the sample level) data preprocess which falls outside the scope of this repo, or (2) (o/w) we should support longer sequence length to cover most full documents.
@andrewkho wonder if the PyTorch dataloading solution would be a good fit here
Hi @tianyu-l yes definitely a good fit here. Hi @TJ-Solergibert and everyone, I'm coming from pytorch/data side of things and think we have some things up our sleeve we could propose that would help here. We're also in contact with the torchtune folks. Let's spend some time testing out some solutions and hopefully find some common ground.
Hi @tianyu-l & @andrewkho,
I've recently submitted #663 with a first prototype. Most of the code comes from torchtune
. I also provide some evidence on why we should develop a solution that is able to pack multiple samples from the Dataset. In short, if we don't do so we will need to pad every sample to the maximum number of images in the batch where every image has shape [Number of tiles, Channels, Tile size, Tile size] --> [4, 3, 448, 448]. And there are samples with LOT'S of images, so this provoques that the majority of the inputs are useless padding tokens. Despite the interest of torchtitan on incorporating a solution with packing or not, I will work on that feature nevertheless.
Toni
Hi!
I’ve started developing the Multimodal DataLoader. After taking a (deep) look at this whole multimodal universe, I would like to discuss a couple of things before continuing. I’m using the torchtune repo as a reference.
As we have already mentioned, the DataLoader will only be compatible with the OBELICS dataset. It’s worth noting that this is a nice dataset since it not only contains (Image, Text) pair samples but also other patterns like (Image, Image, Text, Image, Text) or (Text, Image, Image, Text), among others.
Iterable dataset: I assume the solution must be an Iterable Dataset, like the one already available for text-only pretraining. However, I think it’s necessary to consider the following:
num_workers > 1
in the DataLoader, something we can’t (easily) do with an Iterable one.tokenizer = build_tokenizer("tiktoken", "/workspace/mm/tokenizer.model") data_loader = build_hf_data_loader( dataset_name="c4", dataset_path=None, tokenizer=tokenizer, batch_size=4, seq_len=32, world_size=4, rank=0, )
batch = next(iter(data_loader)) input_ids, labels = batch
for idx, sample in enumerate(input_ids): print(f"| Sample {idx} | {tokenizer.decode(list(sample))}")
| Sample 0 | <|begin_of_text|>Beginners BBQ Class Taking Place in Missoula! Do you want to get better at making delicious BBQ? You will have the opportunity, put this on | Sample 1 | calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level | Sample 2 | for everyone who wants to get better with their culinary skills. He will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques | Sample 3 | recipes, timelines, meat selection and trimming, plus smoker and fire information. The cost to be in the class is $35 per person, and for spectators it