Open NicolasHug opened 2 years ago
The argument input_col
should allow you to do that with map
. You need to download the latest version though.
dp = IterableWrapper([("a", 1), ("b", 2)]).map(fn=lambda char: char + char, input_col=0)
print(list(dp)) # [('aa', 1), ('bb', 2)]
dp = IterableWrapper([("a", 1), ("b", 2)]).map(fn=lambda i: i + 10, input_col=1)
print(list(dp)) # [('a', 11), ('b', 12)]
Thanks @NivekT , I missed that.
Maybe it's a matter of personal preference but I tend to find something like
map_dict("img"=decode, "target"=make_noisy_label)
more natural than
map(decode, input_col="img").map(make_noisy_label, input_col="target")
and similarly on tuples. I think what unsettles me is the use of "col
" which isn't common in the vision domain. There are no columns in the dict, or in the tuple. Perhaps this comes from compatibility concerns with torcharrow?
I even think it might be possible to implement all of map_dict() and map_tuple() functionalities withing the .map() function:
I understand this isn't possible now due to the existence of the other parameters (map_tuple()
and map_dict()
should still be doable though, should we want to?)
Adding map_tuple
and map_dict
should be easy. It will mostly just be wrappers around the existing map
implementation (i.e. passing key
of dict to input_col
and idx
of tuple to input_col
).
It is a tradeoff between the cost adding more DataPipes and names for the benefit of slightly more visibility and perhaps more intuitive naming. I would lean towards adding more examples and tutorial to expose users to what map
can do instead. Users and others should definitely weight in to let us know what you prefer.
cc: @VitalyFedyunin @ejguan
A more left field comment: It feels like we're slowly reinventing https://github.com/more-itertools/more-itertools
Ideally I think we should try to leverage the same API as itertools so users could do something like from torchdata import itertools
and use those well known APIs. Feels similar to compatibility discussion between PyTorch and numpy https://github.com/pytorch/pytorch/issues/50344
@msaroufim I think the API/functionality discussed here is different from itertools
, but nonetheless I see your point. I just opened #756 to discuss.
We often have datapipes that return tuples
(img, target)
where we just want to call transformations on the img, but not the target. Sometimes it's the opposite: I want to apply a function to the target, and not to the img. This usually forces us to write wrappers that "passthrough" either the img or the target. For example:Perhaps a more convenient way of doing this would be to implement something similar to WebDataset's
map_dict
andmap_tuple
? This would avoid all the boilerplate wrappers. For example we could imagine the code above to simply be:I even think it might be possible to implement all of
map_dict()
andmap_tuple()
functionalities withing the.map()
function:map()
map_tuple()
map_dict()
CC @pmeier and @msaroufim to whom this might be of interest