huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

Modify add_column() to optionally accept a FeatureType as param #7143

Closed varadhbhatnagar closed 2 months ago

varadhbhatnagar commented 2 months ago

Fix #7142.

Before (Add + Cast):

from datasets import load_dataset, Value
ds = load_dataset("rotten_tomatoes", split="test")
lst = [i for i in range(len(ds))]

ds = ds.add_column("new_col", lst)
# Assigns int64 to new_col by default
print(ds.features)

ds = ds.cast_column("new_col", Value(dtype="uint16", id=None))
print(ds.features)

Before (Numpy Workaround):

from datasets import load_dataset
import numpy as np
ds = load_dataset("rotten_tomatoes", split="test")
lst = [i for i in range(len(ds))]

ds = ds.add_column("new_col", np.array(lst, dtype=np.uint16))
print(ds.features)

After:

from datasets import load_dataset, Value
ds = load_dataset("rotten_tomatoes", split="test")
lst = [i for i in range(len(ds))]
val = Value(dtype="uint16", id=None))
ds = ds.add_column("new_col", lst, feature=val)
print(ds.features)
varadhbhatnagar commented 2 months ago

Requesting review @lhoestq I will also update the docs if this looks good.

lhoestq commented 2 months ago

Cool ! maybe you can rename the argument feature and with type FeatureType ? This way it would work the same way as .cast_column() ?

varadhbhatnagar commented 2 months ago

@lhoestq Since there is no way to get a pyarrow.Schema from a FeatureType, I had to go via Features. How does this look?

HuggingFaceDocBuilderDev commented 2 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

varadhbhatnagar commented 2 months ago

@lhoestq done!

varadhbhatnagar commented 2 months ago

@lhoestq anything pending on this?