mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.15k stars 145 forks source link

Download optimal for device_per_stream batching method. #726

Open huxuan opened 4 months ago

huxuan commented 4 months ago

Background:

Our data is quite large and varies in size. With a size limit of 100 MB, there will only be 8 or 9 samples per shard. I have noticed that many duplicate shards are downloaded on different nodes even with shuffle disabled. I would like your suggestions on how to avoid duplicate shards.

Additional Information that may be related:

batch_size: 4
shuffle: False
sampling_granularity: 1
num_canonical_nodes: Defaults to the number of physical nodes, which is 4 in our current case
batching_method: device_per_stream

Thoughts

When shuffle is disabled, I assume the shards can be evenly divided among different nodes. Perhaps we could implement something like sample_limit instead of size_limit and achieve that with proper configuration?

snarayan21 commented 4 months ago

Couple things:

While it's possible that device_per_stream batching is causing more duplicated shard downloads than necessary (it's a newly added batching method and may have some stuff to iron out), the rest of your settings seem pretty standard. Since one of Streaming's main features is that it partitions up your shard downloads among your nodes, I'd be very surprised if there were indeed many duplicate shards being downloaded.

huxuan commented 4 months ago
  • How are you verifying that duplicate shards are being downloaded between nodes? Streaming explicitly partitions shard files between nodes so the degree of duplication should be pretty small

Yes, I can confirm not all the shards are downloaded in each nodes, but there are many duplicate ones. I even checked the shard size (in bytes).

  • Is there a reason why you're using device_per_stream batching? Are all your samples homogeneous?

Not all our data are homogeneous, so we packed same kind of data in different streams. But for current test, there is only one stream configured.

  • Related to the question above, are all your shards the same / similar size?

Yes, all the shard are in similar size with size limited configured to 100 MB.

  • To clarify, do you see duplication both with shuffle = True and shuffle = False? A high number of duplicate shard downloads should not happen, regardless of the shuffle setting.

We only use shuffle=False for maximum performance currently.

While it's possible that device_per_stream batching is causing more duplicated shard downloads than necessary (it's a newly added batching method and may have some stuff to iron out), the rest of your settings seem pretty standard. Since one of Streaming's main features is that it partitions up your shard downloads among your nodes, I'd be very surprised if there were indeed many duplicate shards being downloaded.

I suspect it might be related to the number of samples (8 or 9) in each shard that can not evenly divide the batch_size (4). Or I misconfigured something else, but I failed to find that. I will try the default batching_method then, and just let me know if there is something else I can try.

huxuan commented 4 months ago

A quick response that after using the default batching method, there is no obvious duplicate shards now, seems it is caused by the device_per_stream batching method. May come up again with when there is further progress on the investigation.

snarayan21 commented 4 months ago

That makes sense! thanks for investigating. device_per_stream is a newer batching method and so is not completely download-optimal. Some download optimization has been implemented to prevent massive levels of duplication, but as you're observing, it's not completely de-duplicated.

Will keep this issue open as we improve this in the future.