DatasetInfo issue when testing multiple configs: mixed task_templates

BramVanroy commented 2 years ago

Describe the bug

When running the datasets-cli test it would seem that some config properties in a DatasetInfo get mangled, leading to issues, e.g., about the ClassLabel.

Steps to reproduce the bug

In summary, what I want to do is create three configs:

unfiltered: no classlabel, no tasks. Gets data from unfiltered.json.gz (I'd want this without splits, just one chunk of data, but that does not seem possible?)
filtered_sentiment: review_sentiment as ClassLabel, TextClassification task with review_sentiment as label. Gets train/test split from respective json.gz files
filtered_rating: review_rating0 as ClassLabel, TextClassification task with review_rating0 as label. Gets train/test split from respective json.gz files

This might be a bit tedious to reproduce, so I am sorry, but these are the steps:

Clone datasets -> datasets/ and install it
Clone https://huggingface.co/datasets/BramVanroy/hebban-reviews into datasets/datasets so that you have a new folder datasets/datasets/hebban-reviews/.
Replace the HebbanReviews class with this new one:

class HebbanReviews(datasets.GeneratorBasedBuilder):
    """The Hebban book reviews dataset."""

    BUILDER_CONFIGS = [
        HebbanReviewsConfig(
            name="unfiltered",
            description=_HEBBAN_REVIEWS_UNFILTERED_DESCRIPTION,
            version=datasets.Version(_HEBBAN_VERSION)
        ),
        HebbanReviewsConfig(
            name="filtered_sentiment",
            description=f"This config has the negative, neutral, and positive sentiment scores as ClassLabel in the 'review_sentiment' column.\n{_HEBBAN_REVIEWS_FILTERED_DESCRIPTION}",
            version=datasets.Version(_HEBBAN_VERSION)
        ),
        HebbanReviewsConfig(
            name="filtered_rating",
            description=f"This config has the 5-class ratings as ClassLabel in the 'review_rating0' column (which is a variant of 'review_rating' that starts counting from 0 instead of 1).\n{_HEBBAN_REVIEWS_FILTERED_DESCRIPTION}",
            version=datasets.Version(_HEBBAN_VERSION)
        )
    ]

    DEFAULT_CONFIG_NAME = "filtered_sentiment"

    _URLS = {
        "train": "train.jsonl.gz",
        "test": "test.jsonl.gz",
        "unfiltered": "unfiltered.jsonl.gz",
    }

    def _info(self):
        features = {
            "review_title": datasets.Value("string"),
            "review_text": datasets.Value("string"),
            "review_text_without_quotes": datasets.Value("string"),
            "review_n_quotes": datasets.Value("int32"),
            "review_n_tokens": datasets.Value("int32"),
            "review_rating": datasets.Value("int32"),
            "review_rating0": datasets.Value("int32"),
            "review_author_url": datasets.Value("string"),
            "review_author_type": datasets.Value("string"),
            "review_n_likes": datasets.Value("int32"),
            "review_n_comments": datasets.Value("int32"),
            "review_url": datasets.Value("string"),
            "review_published_date": datasets.Value("string"),
            "review_crawl_date": datasets.Value("string"),
            "lid": datasets.Value("string"),
            "lid_probability": datasets.Value("float32"),
            "review_sentiment": datasets.features.ClassLabel(names=["negative", "neutral", "positive"]),
            "review_sentiment_label": datasets.Value("string"),
            "book_id": datasets.Value("int32"),
        }

        if self.config.name == "filtered_sentiment":
            task_templates = [datasets.TextClassification(text_column="review_text_without_quotes", label_column="review_sentiment")]
        elif self.config.name == "filtered_rating":
            # For CrossEntropy, our classes need to start at index 0 -- not 1
            features["review_rating0"] = datasets.features.ClassLabel(names=["1", "2", "3", "4", "5"])
            features["review_sentiment"] = datasets.Value("int32")
            task_templates = [datasets.TextClassification(text_column="review_text_without_quotes", label_column="review_rating0")]
        elif self.config.name == "unfiltered":  # no ClassLabels in unfiltered
            features["review_sentiment"] = datasets.Value("int32")
            task_templates = None
        else:
            raise ValueError(f"Unsupported config {self.config.name}. Expected one of 'filtered_sentiment' (default),"
                             f" 'filtered_rating', or 'unfiltered'")
        print("AT INFO", self.config.name, task_templates)
        return datasets.DatasetInfo(
            description=self.config.description,
            features=datasets.Features(features),
            homepage="https://huggingface.co/datasets/BramVanroy/hebban-reviews",
            citation=_HEBBAN_REVIEWS_CITATION,
            task_templates=task_templates,
            license="cc-by-4.0"
        )

    def _split_generators(self, dl_manager):
        if self.config.name.startswith("filtered"):
            files = dl_manager.download_and_extract({"train": "train.jsonl.gz",
                                                     "test": "test.jsonl.gz"})
            return [
                datasets.SplitGenerator(
                    name=datasets.Split.TRAIN,
                    gen_kwargs={
                        "data_file": files["train"]
                    },
                ),
                datasets.SplitGenerator(
                    name=datasets.Split.TEST,
                    gen_kwargs={
                        "data_file": files["test"]
                    },
                ),
            ]
        elif self.config.name == "unfiltered":
            files = dl_manager.download_and_extract({"train": "unfiltered.jsonl.gz"})
            return [
                datasets.SplitGenerator(
                    name=datasets.Split.TRAIN,
                    gen_kwargs={
                        "data_file": files["train"]
                    },
                ),
            ]
        else:
            raise ValueError(f"Unsupported config {self.config.name}. Expected one of 'filtered_sentiment' (default),"
                             f" 'filtered_rating', or 'unfiltered'")

    def _generate_examples(self, data_file):
        lines = Path(data_file).open(encoding="utf-8").readlines()
        for line_idx, line in enumerate(lines):
            row = json.loads(line)
            yield line_idx, row

finally, run datasets-cli test ./datasets/hebban-reviews/ --save_infos --all_configs from within the topmost datasets directory

Expected results

Succeeding tests for three different configs.

Actual results

I printed out the values that are given to DatasetInfo for config name and task_templates, as you can see. There, as expected, I get unfiltered None. I also modified datasets/info.py and added this line at L.170:

print("INTERNALLY AT INFO.PY", self.config_name, self.task_templates)

to my surprise, here I get unfiltered [TextClassification(task='text-classification', text_column='review_text_without_quotes', label_column='review_sentiment')]. So one way or another, here I suddenly see that unfiltered now does have a task_template -- even though that is not what is written in the data loading script, as the first print statement correctly shows.

I do not quite understand how, but it seems that the config name and task_templates get mixed.

This ultimately leads to the following error, but this trace may not be very useful in itself:

Traceback (most recent call last):
  File "C:\Users\bramv\.virtualenvs\hebban-U6poXNQd\Scripts\datasets-cli-script.py", line 33, in <module>
    sys.exit(load_entry_point('datasets', 'console_scripts', 'datasets-cli')())
  File "c:\dev\python\hebban\datasets\src\datasets\commands\datasets_cli.py", line 39, in main
    service.run()
  File "c:\dev\python\hebban\datasets\src\datasets\commands\test.py", line 144, in run
    builder.as_dataset()
  File "c:\dev\python\hebban\datasets\src\datasets\builder.py", line 899, in as_dataset
    datasets = map_nested(
  File "c:\dev\python\hebban\datasets\src\datasets\utils\py_utils.py", line 393, in map_nested
    mapped = [
  File "c:\dev\python\hebban\datasets\src\datasets\utils\py_utils.py", line 394, in <listcomp>
    _single_map_nested((function, obj, types, None, True, None))
  File "c:\dev\python\hebban\datasets\src\datasets\utils\py_utils.py", line 330, in _single_map_nested
    return function(data_struct)
  File "c:\dev\python\hebban\datasets\src\datasets\builder.py", line 930, in _build_single_dataset
    ds = self._as_dataset(
  File "c:\dev\python\hebban\datasets\src\datasets\builder.py", line 1006, in _as_dataset
    return Dataset(fingerprint=fingerprint, **dataset_kwargs)
  File "c:\dev\python\hebban\datasets\src\datasets\arrow_dataset.py", line 661, in __init__
    info = info.copy() if info is not None else DatasetInfo()
  File "c:\dev\python\hebban\datasets\src\datasets\info.py", line 286, in copy
    return self.__class__(**{k: copy.deepcopy(v) for k, v in self.__dict__.items()})
  File "<string>", line 20, in __init__
  File "c:\dev\python\hebban\datasets\src\datasets\info.py", line 176, in __post_init__
    self.task_templates = [
  File "c:\dev\python\hebban\datasets\src\datasets\info.py", line 177, in <listcomp>
    template.align_with_features(self.features) for template in (self.task_templates)
  File "c:\dev\python\hebban\datasets\src\datasets\tasks\text_classification.py", line 22, in align_with_features
    raise ValueError(f"Column {self.label_column} is not a ClassLabel.")
ValueError: Column review_sentiment is not a ClassLabel.

Environment info

datasets version: 2.4.1.dev0
Platform: Windows-10-10.0.19041-SP0
Python version: 3.8.8
PyArrow version: 8.0.0
Pandas version: 1.4.3

BramVanroy commented 2 years ago

I've narrowed down the issue to the dataset_module_factory which already creates a dataset_infos.json file down in the .cache/modules/dataset_modules/.. folder. That JSON file already contains the wrong task_templates for unfiltered.

BramVanroy commented 2 years ago

Ugh. Found the issue: apparently datasets was reusing the already existing dataset_infos.json that is inside datasets/datasets/hebban-reviews! Is this desired behavior?

Perhaps when --save_infos and --all_configs are given, an existing dataset_infos.json file should first be deleted before continuing with the test? Because that would assume that the user wants to create a new infos file for all configs anyway.

mariosasko commented 2 years ago

Hi! I think this is a reasonable solution. Would you be interested in submitting a PR?

huggingface / datasets