huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.6k stars 2.55k forks source link

List of dictionary features get standardized #6899

Open sohamparikh94 opened 2 weeks ago

sohamparikh94 commented 2 weeks ago

Describe the bug

Hi, i’m trying to create a HF dataset from a list using Dataset.from_list.

Each sample in the list is a dict with the same keys (which will be my features). The values for each feature are a list of dictionaries, and each such dictionary has a different set of keys. However, the datasets library standardizes all dictionaries under a feature and adds all possible keys (with None value) from all the dictionaries under that feature.

How can I keep the same set of keys as in the original list for each dictionary under a feature?

Steps to reproduce the bug

from datasets import Dataset

# Define a function to generate a sample with "tools" feature
def generate_sample():
    # Generate random sample data
    sample_data = {
        "text": "Sample text",
        "feature_1": []
    }

    # Add feature_1 with random keys for this sample
    feature_1 = [{"key1": "value1"}, {"key2": "value2"}]  # Example feature_1 with random keys
    sample_data["feature_1"].extend(feature_1)

    return sample_data

# Generate multiple samples
num_samples = 10
samples = [generate_sample() for _ in range(num_samples)]

# Create a Hugging Face Dataset
dataset = Dataset.from_list(samples)
dataset[0]

{'text': 'Sample text', 'feature_1': [{'key1': 'value1', 'key2': None}, {'key1': None, 'key2': 'value2'}]}

Expected behavior

{'text': 'Sample text', 'feature_1': [{'key1': 'value1'}, {'key2': 'value2'}]}

Environment info