Open LWprogramming opened 7 months ago
One possibility that might work if a single data item never gets split across multiple shards is to search the existing folder/cloud storage directory for shard names, pulling the existing shard down to a temporary folder, and pick up where we left off using the index.json in that directory.
Alternatively (less efficient but maybe easier to work with) is to just start a new shard (e.g. if there are shards 0.mds.zstd through 17.mds.zstd, just create 18.mds.zstd when opening the second MDSWriter). These approaches seem plausible for Azure at least, I'm not super familiar with all the different types of uploaders in streaming/base/storage/upload.py
, or with all the details of the process of actually converting the sample
dict into something that goes into the mds file.
@LWprogramming You can also start writing shard files to a different directory and use the merge_index
function to combine the index files from multiple directories! But to your point, starting from where you left off for shard writing would be nice. We can look into it in upcoming planning.
Cool! Out of curiosity, when might we use something like merge_index
vs. just splitting the dataset into a bunch of pieces, uploading each piece to a separate folder, and then creating a stream for each folder e.g.
from streaming import StreamingDataset, Stream
locals = [
"/foo/bar1",
"/foo/bar2"
]
remotes = [
f"azure://foo/bar1",
f"azure://foo/bar2",
]
streams = [
Stream(local=local, remote=remote) for local, remote in zip(locals, remotes)
]
ds = StreamingDataset(streams=streams, shuffle=False)
Is the main difference in what shuffling algorithms we can use? It looks like even with multiple streams, it's possible to do dataset shuffling.
🚀 Feature Request
Suppose we create a dataset
Then later create a second MDSWriter to write data points 101-200 similarly, the second MDSWriter overwrites the existing shards. The preferable thing would be to continue writing new shards as though we had looped through 0-200 originally.
Motivation
Cleaned data comes in piecemeal and it would be nice to be able to just continue augmenting the existing cleaned dataset that's already been turned into a StreamingDataset format. Not sure if this would be particularly tricky or easy to do, or if it already exists and I'm missing a flag somewhere.