mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.01k stars 125 forks source link

Per-stream processing #392

Open lorabit110 opened 10 months ago

lorabit110 commented 10 months ago

🚀 Feature Request

When I use multiple Streams to create a StreamingDataset, I want to be able to use a different pre-processing function to process the data in each Stream. For example, Stream A needs special label masking while Stream B doesn't.

Motivation

This is commonly needed for multi-task training, for example, UL2 training. Currently, my workaround is to insert a task / source column to those streams and use my own StreamingDataset class to produce labels differently based on the task / source column. However, this requires changes to the materialized datasets.

karan6181 commented 10 months ago

Hi @lorabit110 , are you planning to apply a pre-processing function during def __getitem__() function, something like this ?

lorabit110 commented 10 months ago

Yes. But in your example, it's a StreamingDataset-specific pre-processing function. What I need is to provide a Stream-specific pre-processing function. Or is there a way to create a mixture with multiple StreamingDatasets?

karan6181 commented 9 months ago

@lorabit110 , wondering, have you tried ChainDataset ? where you can pass sequence of StreamingDataset class? You can have your own pre-processing logic per StreamingDataset class.

karan6181 commented 8 months ago

@lorabit110, I am checking if you have had a chance to try the above solution.

siddk commented 2 months ago

Hey @karan6181 -- the ChainDataset solution means that I lose any proportional sampling behavior I'd get by loading multiple streams in a single StreamingDataset().

Is there no other way to apply a Stream-specific transform while keeping all the StreamingDataset() machinery?

snarayan21 commented 1 month ago

Hey @siddk, there currently isn't a per-stream processing function, but it's something we can add in the future!