huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

mismatch for datatypes when providing `Features` with `Array2D` and user specified `dtype` and using with_format("numpy") #7254

Open Akhil-CM opened 3 weeks ago

Akhil-CM commented 3 weeks ago

Describe the bug

If the user provides a Features type value to datasets.Dataset with members having Array2D with a value for dtype, it is not respected during with_format("numpy") which should return a np.array with dtype that the user provided for Array2D. It seems for floats, it will be set to float32 and for ints it will be set to int64

Steps to reproduce the bug

import numpy as np
import datasets
from datasets import Dataset, Features, Array2D

print(f"datasets version: {datasets.__version__}")

data_info = {
    "arr_float" : "float64",
    "arr_int" : "int32"
}

sample = {key : [np.zeros([4, 5], dtype=dtype)] for key, dtype in data_info.items()}

features = {key : Array2D(shape=(None, 5), dtype=dtype) for key, dtype in data_info.items()}
features = Features(features)

dataset = Dataset.from_dict(sample, features=features)

ds = dataset.with_format("numpy")
for key in features:
    print(f"{key} feature dtype: ", ds.features[key].dtype)
    print(f"{key} dtype:", ds[key].dtype)

Output:

datasets version: 3.0.2
arr_float feature dtype:  float64
arr_float dtype: float32
arr_int feature dtype:  int32
arr_int dtype: int64

Expected behavior

It should return a np.array with dtype that the user provided for the corresponding member in the Features type value

Environment info

Akhil-CM commented 3 weeks ago

It seems that https://github.com/huggingface/datasets/issues/5517 is exactly the same issue.

It was mentioned there that this would be fixed in version 3.x