pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.12k stars 149 forks source link

A more powerful Mapper than can restrict function application to only part of the datapipe items? #754

Open NicolasHug opened 2 years ago

NicolasHug commented 2 years ago

We often have datapipes that return tuples (img, target) where we just want to call transformations on the img, but not the target. Sometimes it's the opposite: I want to apply a function to the target, and not to the img. This usually forces us to write wrappers that "passthrough" either the img or the target. For example:


def decode_img_only(data):  # boilerplate wrapper
    img, target = data
    img = decode(img)
    return img, data

def resize_img_only(data):  # boilerplate wrapper
    img, target = data
    img = resize(img)
    return img, data

def add_label_noise(data):  # boilerplate wrapper
    img, target = data
    target = make_noisy_label(target)
    return img, data

dp = ...
dp = dp.map(decode_img_only).map(resize_img_only).map(add_label_noise)

Perhaps a more convenient way of doing this would be to implement something similar to WebDataset's map_dict and map_tuple? This would avoid all the boilerplate wrappers. For example we could imagine the code above to simply be:

dp = ...
dp = dp.map_tuple(decode, None).map(resize, None).map(None, make_noisy_label)
# or even
dp = dp.map_tuple(decode, None).map(resize, make_noisy_label)

# if the datapipes was returning a dict with "img" and "target" keys this could also be

dp = dp.map_dict("img"=decode).map_dict("img"=decode, "target"=make_noisy_label)

I even think it might be possible to implement all of map_dict() and map_tuple() functionalities withing the .map() function:

CC @pmeier and @msaroufim to whom this might be of interest

NivekT commented 2 years ago

The argument input_col should allow you to do that with map. You need to download the latest version though.

dp = IterableWrapper([("a", 1), ("b", 2)]).map(fn=lambda char: char + char, input_col=0)
print(list(dp))  # [('aa', 1), ('bb', 2)]

dp = IterableWrapper([("a", 1), ("b", 2)]).map(fn=lambda i: i + 10, input_col=1)
print(list(dp))  # [('a', 11), ('b', 12)]
NicolasHug commented 2 years ago

Thanks @NivekT , I missed that.

Maybe it's a matter of personal preference but I tend to find something like

map_dict("img"=decode, "target"=make_noisy_label)

more natural than

map(decode, input_col="img").map(make_noisy_label, input_col="target")

and similarly on tuples. I think what unsettles me is the use of "col" which isn't common in the vision domain. There are no columns in the dict, or in the tuple. Perhaps this comes from compatibility concerns with torcharrow?

I even think it might be possible to implement all of map_dict() and map_tuple() functionalities withing the .map() function:

I understand this isn't possible now due to the existence of the other parameters (map_tuple() and map_dict() should still be doable though, should we want to?)

NivekT commented 2 years ago

Adding map_tuple and map_dict should be easy. It will mostly just be wrappers around the existing map implementation (i.e. passing key of dict to input_col and idx of tuple to input_col).

It is a tradeoff between the cost adding more DataPipes and names for the benefit of slightly more visibility and perhaps more intuitive naming. I would lean towards adding more examples and tutorial to expose users to what map can do instead. Users and others should definitely weight in to let us know what you prefer.

cc: @VitalyFedyunin @ejguan

msaroufim commented 2 years ago

A more left field comment: It feels like we're slowly reinventing https://github.com/more-itertools/more-itertools

Ideally I think we should try to leverage the same API as itertools so users could do something like from torchdata import itertools and use those well known APIs. Feels similar to compatibility discussion between PyTorch and numpy https://github.com/pytorch/pytorch/issues/50344

NivekT commented 2 years ago

@msaroufim I think the API/functionality discussed here is different from itertools, but nonetheless I see your point. I just opened #756 to discuss.