mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.01k stars 125 forks source link

AttributeError when trying to convert Imagenet1k #707

Closed Hprairie closed 1 week ago

Hprairie commented 1 week ago

Environment

To reproduce

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/work/09753/hprairie/ls6/projects/MambaX/src/training/data/imagenet1k.py", line 172, in <module>
    main(parse_args())
  File "/work/09753/hprairie/ls6/projects/MambaX/src/training/data/imagenet1k.py", line 162, in main
    out.write(
  File "/work/09753/hprairie/ls6/miniconda3/envs/mambaX/lib/python3.11/site-packages/streaming/base/format/base/writer.py", line 259, in write
    new_sample = self.encode_sample(sample)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/09753/hprairie/ls6/miniconda3/envs/mambaX/lib/python3.11/site-packages/streaming/base/format/mds/writer.py", line 106, in encode_sample
    datum = mds_encode(encoding, value)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/09753/hprairie/ls6/miniconda3/envs/mambaX/lib/python3.11/site-packages/streaming/base/format/mds/encodings.py", line 619, in mds_encode
    return coder.encode(obj)
           ^^^^^^^^^^^^^^^^^
  File "/work/09753/hprairie/ls6/miniconda3/envs/mambaX/lib/python3.11/site-packages/streaming/base/format/mds/encodings.py", line 89, in encode
    self._validate(obj, int)
  File "/work/09753/hprairie/ls6/miniconda3/envs/mambaX/lib/python3.11/site-packages/streaming/base/format/mds/encodings.py", line 57, in _validate
    raise AttributeError(
AttributeError: data should be of type <class 'int'>, but instead, found as <class 'numpy.int64'>

When trying to convert Imagenet1k I get the following error. I am using the script provided in the repo to convert imagenet1k Image Folder to streaming.

Any quick work around or places to directly download the MDS dataset would be very much appreciated.

Hprairie commented 1 week ago

Ahh nvm, stupid issue, just need to wrap i with int when we are encoding the data, this should probably be updated in vision/convert/imagenet.py.

karan6181 commented 1 week ago

@Hprairie Thanks for bringing this to our attention. If you have the bandwidth, would you mind creating a pull request by following the contributing guidelines?

Hprairie commented 1 week ago

Yeah for sure!