huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Iterable dataset.filter should not override features #7208

Closed alex-hh closed 1 month ago

alex-hh commented 1 month ago

Describe the bug

When calling filter on an iterable dataset, the features get set to None

Steps to reproduce the bug

import numpy as np import time from datasets import Dataset, Features, Array3D

features=Features(**{"array0": Array3D((None, 10, 10), dtype="float32"), "array1": Array3D((None,10,10), dtype="float32")})
dataset = Dataset.from_dict({f"array{i}": [np.zeros((x,10,10), dtype=np.float32) for x in [2000,1000]*25] for i in range(2)}, features=features)
ds = dataset.to_iterable_dataset()
orig_column_names = ds.column_names
ds = ds.filter(lambda x: True)
assert ds.column_names == orig_column_names

Expected behavior

Filter should preserve features information

Environment info

3.0.2

lhoestq commented 1 month ago

closed by https://github.com/huggingface/datasets/pull/7209, thanks @alex-hh !