pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 153 forks source link

Additional basic functions beyond .map to allow for more functional programming #1145

Open hhoeflin opened 1 year ago

hhoeflin commented 1 year ago

🚀 The feature

For IterDataPipe, the .map maps a function over the items of an iterable. where the function has the form

f: Any -> Any

Other basic building blocks could be .pipe, .iter_map and .comsume. where

Motivation, pitch

Such an approach would allow for more flexible functional programming and would reduce most currently provided IterDataPipe classes to a simple functional call. For example

The Enumerator class would become

dp.pipe(enumerate)

This would immediately enable to use all itertools functions in this context.

The TarArchiveLoader could become

def iter_from_tar_archive(fd):
    .<code to yield files from tar archive >
dp.iter_map(iter_from_tar_archive)

I believe using this approach, almost all provided classes could be written using less boilerplate using generator functions (essentially just writing the code inside __iter__ as a standalone generator function, possibly curried for convenience if other parameters are being used).

Would be great to hear if this was considered? Thanks!

Alternatives

The .pipe can already be written as

dp2 = IterableWrapper(enumerate(dp)) 

but I believe this would be a lot less nice than the above

dp.pipe(enumerate)

Additional context

No response

hhoeflin commented 1 year ago

Just wanted to ping about this issue. Would be great to hear the development teams perspective. Even after looking into it more, it still appears to me that most of the functionality provided could be exposed as individual functions.

Would be great to know if I am missing something or misunderstand about the functionality of torchdata.

Thanks