huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.92k stars 2.62k forks source link

Save nparray as list #7049

Closed Sakurakdx closed 1 month ago

Sakurakdx commented 1 month ago

Describe the bug

When I use the map function to convert images into features, datasets saves nparray as a list. Some people use the set_format function to convert the column back, but doesn't this lose precision?

Steps to reproduce the bug

the map function

def convert_image_to_features(inst, processor, image_dir):
    image_file = inst["image_url"]
    file = image_file.split("/")[-1]
    image_path = os.path.join(image_dir, file)
    image = Image.open(image_path)
    image = image.convert("RGBA")

    inst["pixel_values"] = processor(images=image, return_tensors="np")["pixel_values"]
    return inst

main function

map_fun = partial(
      convert_image_to_features, processor=processor, image_dir=image_dir
  )
ds = ds.map(map_fun, batched=False, num_proc=20)
print(type(ds[0]["pixel_values"])

Expected behavior

(type < list>)

Environment info

Sakurakdx commented 1 month ago

In addition, when I use set_format and index the ds, the following error occurs: the code

ds.set_format(type="np", colums="pixel_values")

error

image
lhoestq commented 1 month ago

Some people use the set_format function to convert the column back, but doesn't this lose precision?

Under the hood the data is saved in Arrow format using the same precision as your numpy arrays? By default the Arrow data is read as python lists, but you can indeed read them back as numpy arrays with the same precision

lhoestq commented 1 month ago

(you can fix your second issue by fixing the typo colums -> columns)

Sakurakdx commented 1 month ago

(you can fix your second issue by fixing the typo colums -> columns)

You are right, I was careless. Thank you.

Sakurakdx commented 1 month ago

Some people use the set_format function to convert the column back, but doesn't this lose precision?

Under the hood the data is saved in Arrow format using the same precision as your numpy arrays? By default the Arrow data is read as python lists, but you can indeed read them back as numpy arrays with the same precision

Yes, after testing I found that there was no loss of precision. Thanks again for your answer.