mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.01k stars 125 forks source link

Handle zero-sized ndarray more gracefully #695

Closed huxuan closed 2 weeks ago

huxuan commented 3 weeks ago

Environment

To reproduce

Steps to reproduce the behavior:

  1. Create a python script (e.g. main.py) with the following snippets:

    from streaming import MDSWriter
    import numpy as np
    
    with MDSWriter(out="data/", columns={"data": "ndarray"}) as writer:
       writer.write({"data": np.empty(())})
  2. run with python main.py and it will complain about

    Traceback (most recent call last):
     File "/Users/huxuan/Code/test/1.py", line 5, in <module>
       writer.write({"data": np.empty(())})
     File "/Users/huxuan/Code/test/.venv/lib/python3.12/site-packages/streaming/base/format/base/writer.py", line 259, in write
       new_sample = self.encode_sample(sample)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/Users/huxuan/Code/test/.venv/lib/python3.12/site-packages/streaming/base/format/mds/writer.py", line 106, in encode_sample
       datum = mds_encode(encoding, value)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/Users/huxuan/Code/test/.venv/lib/python3.12/site-packages/streaming/base/format/mds/encodings.py", line 619, in mds_encode
       return coder.encode(obj)
              ^^^^^^^^^^^^^^^^^
     File "/Users/huxuan/Code/test/.venv/lib/python3.12/site-packages/streaming/base/format/mds/encodings.py", line 244, in encode
       shape_dtype = self._rightsize_shape_dtype(shape_arr)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/Users/huxuan/Code/test/.venv/lib/python3.12/site-packages/streaming/base/format/mds/encodings.py", line 204, in _rightsize_shape_dtype
       if shape.min() <= 0:
          ^^^^^^^^^^^
     File "/Users/huxuan/Code/test/.venv/lib/python3.12/site-packages/numpy/core/_methods.py", line 45, in _amin
       return umr_minimum(a, axis, None, out, keepdims, initial, where)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ValueError: zero-size array to reduction operation minimum which has no identity

Expected behavior

I think we should either handle the case silently or give more friendly error message.

Additional context

snarayan21 commented 2 weeks ago

Hey @huxuan, thanks for raising this issue! If you wouldn't mind, could you open up a PR that adds a better error message (ValueError) in this ndarray encoding function for when the input np array has no elements? We always welcome open source contributions!