mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
459 stars 42 forks source link

Native Loader Integration IR #451

Open mkuchnik opened 10 months ago

mkuchnik commented 10 months ago

For peak performance, each loader may have its own way of achieving certain operations. It would be useful to offer an intermediate representation that can be "lowered" to each respective dataloader variety. For example, I can imagine something like (loosely):

READ_FILES(["a.txt", "b.csv"]),
MAP(UNPACK_FILES),
SORT_VALUES("column_1"),
...

Croissant also has the entire graph of computations, so that is suitable as well. I wonder if it is worth exporting such an internal representation so that dataloaders can implement their own visitor pattern to "compile" down to the native code without having to use intermediate data representations.

marcenacp commented 10 months ago

@mkuchnik Hey Michael, that's sounds very nice! Another way to look at it is to care about performance in some isolated cases only:

  1. If the graph of computations is sequential (e.g., no join)
  2. If the underlying data is "ML-optimized" (Parquet, ArrayRecord, TFRecord)

Indeed, this means that the data is already prepared for data-intensive ML workflows. In other cases:

So in the cases 1. and 2. are true, we could adapt tfds.data_source and datasets.Dataset to work smoothly with torch.utils.data.DataLoader and Croissant.

What do you think? Also, what do you mean when you say "each respective dataloader variety"?

mkuchnik commented 9 months ago

@marcenacp This is roughly what I had in mind and indeed that criteria seems like an appropriate fast-path. It would be great to avoid the intermediate serialization/copies in such cases.

For "each respective dataloader variety", you likely want a closure that is most compatible with the backend. For example, if the backend supports native operators, it may be more efficient to use those than plain Python.