mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.02k stars 126 forks source link

Distributed Key Value Tensor Store #539

Open OrenLeung opened 6 months ago

OrenLeung commented 6 months ago

Is it possible to use streaming dataset as a distributed key value store?

i have a set of keys (strings like "xyz_123") each that correspond to an numpy array

ideally I can do something like

np_array = dataset["xyz_123"]

but i see with MDSWriter.write that the keys of the dataset are just sequential and i can't change them.

Is there a way to have a custom key for MDSWriter?

karan6181 commented 6 months ago

Hi @OrenLeung, what is the size of the dataset and how many unique keys you have in the dataset?

OrenLeung commented 5 months ago

@karan6181 the size is about 1 TB and about 100k unique keys