salu133445 / muspy

A toolkit for symbolic music generation
https://salu133445.github.io/muspy/
MIT License
435 stars 51 forks source link

Transforming datasets #31

Open cifkao opened 3 years ago

cifkao commented 3 years ago

MusPy seems to have a really smooth pipeline for (down)loading a dataset and iterating over it or converting it to a PyTorch or TensorFlow dataset using one of the pre-defined representations. What I would like to do, and doesn't seem to be currently easy to do, is creating a new dataset object by transforming an existing dataset. An example use case would be to download the Lakh dataset, filter it using some criteria, split it into short segments, apply some data augmentation, and then use this to train a PyTorch model. Maybe something like this:

lmd_split = lmd.transform(filter_and_split_fn, "data/lmd_split")  # transforms and saves dataset or reuses existing result
lmd_aug = lmd_split.transform(aug_fn, "data/lmd_aug")
lmd_aug.to_pytorch_dataset(representation="pianoroll")

where each transform function would be a function taking a single Music object and returning a list of Music objects.

Another (even more general, but maybe less efficient) possibility would be to be able to create a new dataset from a generator, e.g.:

def g():
    for music in lmd:
        yield music.transpose(1)

lmd_aug = FolderDataset.from_generator(g(), "data/lmd_aug")