Open mkuchnik opened 10 months ago
@mkuchnik Hey Michael, that's sounds very nice! Another way to look at it is to care about performance in some isolated cases only:
Indeed, this means that the data is already prepared for data-intensive ML workflows. In other cases:
So in the cases 1. and 2. are true, we could adapt tfds.data_source
and datasets.Dataset
to work smoothly with torch.utils.data.DataLoader and Croissant.
What do you think? Also, what do you mean when you say "each respective dataloader variety"?
@marcenacp This is roughly what I had in mind and indeed that criteria seems like an appropriate fast-path. It would be great to avoid the intermediate serialization/copies in such cases.
For "each respective dataloader variety", you likely want a closure that is most compatible with the backend. For example, if the backend supports native operators, it may be more efficient to use those than plain Python.
For peak performance, each loader may have its own way of achieving certain operations. It would be useful to offer an intermediate representation that can be "lowered" to each respective dataloader variety. For example, I can imagine something like (loosely):
Croissant also has the entire graph of computations, so that is suitable as well. I wonder if it is worth exporting such an internal representation so that dataloaders can implement their own visitor pattern to "compile" down to the native code without having to use intermediate data representations.