huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

How to debugging #7249

Open ShDdu opened 4 weeks ago

ShDdu commented 4 weeks ago

Describe the bug

I wanted to use my own script to handle the processing, and followed the tutorial documentation by rewriting the MyDatasetConfig and MyDatasetBuilder (which contains the _info,_split_generators and _generate_examples methods) classes. Testing with simple data was able to output the results of the processing, but when I wished to do more complex processing, I found that I was unable to debug (even the simple samples were inaccessible). There are no errors reported, and I am able to print the _info,_split_generators and _generate_examples messages, but I am unable to access the breakpoints.

Steps to reproduce the bug

my_dataset.py

import json import datasets

class MyDatasetConfig(datasets.BuilderConfig): def init(self, kwargs): super(MyDatasetConfig, self).init(kwargs)

class MyDataset(datasets.GeneratorBasedBuilder): VERSION = datasets.Version("1.0.0")

BUILDER_CONFIGS = [
    MyDatasetConfig(
        name="default",
        version=VERSION,
        description="myDATASET"
    ),
]

def _info(self):
    print("info")  # breakpoints
    return datasets.DatasetInfo(
        description="myDATASET",
        features=datasets.Features(
            {
                "id": datasets.Value("int32"),
                "text": datasets.Value("string"),
                "label": datasets.ClassLabel(names=["negative", "positive"]),
            }
        ),
        supervised_keys=("text", "label"),
    )

def _split_generators(self, dl_manager):

    print("generate")  # breakpoints
    data_file = "data.json"  

    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN, gen_kwargs={"filepath": data_file}
        ),
    ]

def _generate_examples(self, filepath):
    print("example")  # breakpoints
    with open(filepath, encoding="utf-8") as f:
        data = json.load(f)
        for idx, sample in enumerate(data):
            yield idx, {
                "id": sample["id"],
                "text": sample["text"],
                "label": sample["label"],
            }

main.py

import os os.environ["TRANSFORMERS_NO_MULTIPROCESSING"] = "1"

from datasets import load_dataset

dataset = load_dataset("my_dataset.py", split="train", cache_dir=None)

print(dataset[:5])

Expected behavior

Pause at breakpoints while running debugging

Environment info

pycharm