improve data pipeline by adding MindData native support

mindspore-lab / mindone

one for all, Optimal generator with No Exception

https://mindspore-lab.github.io/mindone/

Apache License 2.0

338 stars 63 forks source link

improve data pipeline by adding MindData native support #301

Closed hadipash closed 5 months ago

hadipash commented 5 months ago

This PR introduces native support for the MindData pipeline, building upon our successful experiments in the MindOCR project (https://github.com/mindspore-lab/mindocr/pull/416). Our experiments showed that this pipeline not only simplifies the development process through modularization but also improves training speed by up to 3 times.

However, as the native pipeline integration occurred during the late stages of MindOCR development, the process was quite challenging, and the PR has not yet been merged. To prevent similar problems in MindONE, I strongly recommend adopting the new pipeline as early as possible.

The idea is as follows: each dataset in MindONE should inherit from the BaseDataset class, define the required attribute output_columns, and implement the abstract methods. This includes train_transforms, which is used later in build_dataloader to map transformations to the training data.

wtomin commented 5 months ago

Very good PR. I think it now works nicely with sdv2. Considering that SD-XL has batched_transforms (some transforms that are applied after a batch is sampled, see collate_fn), how to adjust the current PR to support batched_transforms as well?

hadipash commented 5 months ago

@wtomin There are no true batch transformations in SDXL. BatchedResizedAndRandomCrop, BatchedRescaler, and BatchedTranspose are sample-based and use a for loop to iterate over samples, e.g.: https://github.com/mindspore-lab/mindone/blob/86bea01f062c129715612ac823b033c022998538/examples/stable_diffusion_xl/gm/data/mappers/batched_mappers.py#L34-L41 Additionally, in the case of videos, the same transformations (e.g. flipping frames) can be applied to all frames extracted from a video at the same time to ensure temporal consistency before batching.

In rare cases when we truly need batch transformations, we can integrate per_batch_map support.

SamitHuang commented 5 months ago

Nice. btw, how is the training performance changed with this new data pipeline?

zhtmike commented 5 months ago

will it broke the old configuration file of the training scripts (like train_text_to_image)?

hadipash commented 5 months ago

Nice. btw, how is the training performance changed with this new data pipeline?

@SamitHuang There is no speed improvement because the network propagation time exceeds the data fetching time. A speed improvement will be noticeable if the data preprocessing stage is more resource intensive or if the network is lighter.

Currently, the main advantage of the new data pipeline is simplicity (ImageDataset here: https://github.com/mindspore-lab/mindone/pull/301/commits/51d7d333f720a6c9ae0e1d387b01630de6da3bb9) and better compatibility with MindSpore.

will it broke the old configuration file of the training scripts (like train_text_to_image)?

@zhtmike No changes to the configuration files are required. Only slight modifications to the training script (e.g., train_text_to_image.py) are needed.