Open huxuan opened 4 months ago
Couple things:
device_per_stream
batching? Are all your samples homogeneous?shuffle = True
and shuffle = False
? A high number of duplicate shard downloads should not happen, regardless of the shuffle setting.While it's possible that device_per_stream
batching is causing more duplicated shard downloads than necessary (it's a newly added batching method and may have some stuff to iron out), the rest of your settings seem pretty standard. Since one of Streaming's main features is that it partitions up your shard downloads among your nodes, I'd be very surprised if there were indeed many duplicate shards being downloaded.
- How are you verifying that duplicate shards are being downloaded between nodes? Streaming explicitly partitions shard files between nodes so the degree of duplication should be pretty small
Yes, I can confirm not all the shards are downloaded in each nodes, but there are many duplicate ones. I even checked the shard size (in bytes).
- Is there a reason why you're using
device_per_stream
batching? Are all your samples homogeneous?
Not all our data are homogeneous, so we packed same kind of data in different streams. But for current test, there is only one stream configured.
- Related to the question above, are all your shards the same / similar size?
Yes, all the shard are in similar size with size limited configured to 100 MB.
- To clarify, do you see duplication both with
shuffle = True
andshuffle = False
? A high number of duplicate shard downloads should not happen, regardless of the shuffle setting.
We only use shuffle=False
for maximum performance currently.
While it's possible that
device_per_stream
batching is causing more duplicated shard downloads than necessary (it's a newly added batching method and may have some stuff to iron out), the rest of your settings seem pretty standard. Since one of Streaming's main features is that it partitions up your shard downloads among your nodes, I'd be very surprised if there were indeed many duplicate shards being downloaded.
I suspect it might be related to the number of samples (8 or 9) in each shard that can not evenly divide the batch_size (4). Or I misconfigured something else, but I failed to find that. I will try the default batching_method
then, and just let me know if there is something else I can try.
A quick response that after using the default batching method, there is no obvious duplicate shards now, seems it is caused by the device_per_stream
batching method. May come up again with when there is further progress on the investigation.
That makes sense! thanks for investigating. device_per_stream
is a newer batching method and so is not completely download-optimal. Some download optimization has been implemented to prevent massive levels of duplication, but as you're observing, it's not completely de-duplicated.
Will keep this issue open as we improve this in the future.
Background:
Our data is quite large and varies in size. With a size limit of 100 MB, there will only be 8 or 9 samples per shard. I have noticed that many duplicate shards are downloaded on different nodes even with
shuffle
disabled. I would like your suggestions on how to avoid duplicate shards.Additional Information that may be related:
Thoughts
When shuffle is disabled, I assume the shards can be evenly divided among different nodes. Perhaps we could implement something like sample_limit instead of size_limit and achieve that with proper configuration?