mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.14k stars 142 forks source link

Support online de-compressing of shards on LocalDataset as it is already done for StreamingDataset #416

Closed sagnak closed 1 year ago

sagnak commented 1 year ago

Support online de-compressing of shards on LocalDataset as it is already done for StreamingDataset

When one creates a mosaic dataset to be streamed from the cloud, using StreamingDataset, it is possible to delegate the online de-compression of the shard to an mds file to the library. This is however not supported for LocalDataset and if the same dataset were to be used on a local filesystem, the library does not automatically decompress it. Below is the error I get when I use a compressed shard dataset

In [1]: from streaming import LocalDataset

In [2]: ds = LocalDataset('/mnt/data-obsd/mosaic/dataset')

In [3]: ds[0]
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[3], line 1
----> 1 ds[0]

File ~/.local/lib/python3.8/site-packages/streaming/base/array.py:90, in Array.__getitem__(self, at)
     88     if -self.size <= at < 0:
     89         at += self.size
---> 90     return self.get_item(at)
     91 elif isinstance(at, slice):
     92     items = []

File ~/.local/lib/python3.8/site-packages/streaming/base/local.py:77, in LocalDataset.get_item(self, sample_id)
     75 shard_id, index_in_shard = self.spanner[sample_id]
     76 shard = self.shards[shard_id]
---> 77 return shard[index_in_shard]

File ~/.local/lib/python3.8/site-packages/streaming/base/array.py:90, in Array.__getitem__(self, at)
     88     if -self.size <= at < 0:
     89         at += self.size
---> 90     return self.get_item(at)
     91 elif isinstance(at, slice):
     92     items = []

File ~/.local/lib/python3.8/site-packages/streaming/base/format/base/reader.py:257, in Reader.get_item(self, idx)
    248 def get_item(self, idx: int) -> Dict[str, Any]:
    249     """Get the sample at the index.
    250 
    251     Args:
   (...)
    255         Dict[str, Any]: Sample dict.
    256     """
--> 257     data = self.get_sample_data(idx)
    258     return self.decode_sample(data)

File ~/.local/lib/python3.8/site-packages/streaming/base/format/mds/reader.py:121, in MDSReader.get_sample_data(self, idx)
    119 filename = os.path.join(self.dirname, self.split, self.raw_data.basename)
    120 offset = (1 + idx) * 4
--> 121 with open(filename, 'rb', 0) as fp:
    122     fp.seek(offset)
    123     pair = fp.read(8)

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data-obsd/mosaic/dataset/shard.00000.mds'

In [4]: !ls /mnt/data-obsd/mosaic/dataset/shard.00000.mds*
/mnt/data-obsd/mosaic/dataset/shard.00000.mds.zstd
karan6181 commented 1 year ago

Hi @sagnak , one can use StreamingDataset for streaming the data from cloud as well as loading the data from local. For example, if your dataset resides locally, you can simply run it as

dataset = StreamingDataset(local='/mnt/data-obsd/mosaic/dataset') # This will read the data directly from the $local directory
dataset = StreamingDataset(remote='/mnt/data-obsd/mosaic/dataset', local='/tmp/dataset') # This will copy the data from $remote to $local and read the data from $local
karan6181 commented 1 year ago

Hi @sagnak, does my above solution work for you? I am closing this issue for now. Please feel free to re-open if you are still seeing the issue. Thanks!