Open BramVanroy opened 2 years ago
I've narrowed down the issue to the dataset_module_factory
which already creates a dataset_infos.json
file down in the .cache/modules/dataset_modules/..
folder. That JSON file already contains the wrong task_templates for unfiltered
.
Ugh. Found the issue: apparently datasets
was reusing the already existing dataset_infos.json
that is inside datasets/datasets/hebban-reviews
! Is this desired behavior?
Perhaps when --save_infos
and --all_configs
are given, an existing dataset_infos.json
file should first be deleted before continuing with the test? Because that would assume that the user wants to create a new infos file for all configs anyway.
Hi! I think this is a reasonable solution. Would you be interested in submitting a PR?
Describe the bug
When running the
datasets-cli test
it would seem that some config properties in a DatasetInfo get mangled, leading to issues, e.g., about the ClassLabel.Steps to reproduce the bug
In summary, what I want to do is create three configs:
review_sentiment
as ClassLabel, TextClassification task withreview_sentiment
as label. Gets train/test split from respective json.gz filesreview_rating0
as ClassLabel, TextClassification task withreview_rating0
as label. Gets train/test split from respective json.gz filesThis might be a bit tedious to reproduce, so I am sorry, but these are the steps:
datasets/
and install ithttps://huggingface.co/datasets/BramVanroy/hebban-reviews
intodatasets/datasets
so that you have a new folderdatasets/datasets/hebban-reviews/
.datasets-cli test ./datasets/hebban-reviews/ --save_infos --all_configs
from within the topmostdatasets
directoryExpected results
Succeeding tests for three different configs.
Actual results
I printed out the values that are given to
DatasetInfo
for config name and task_templates, as you can see. There, as expected, I getunfiltered None
. I also modified datasets/info.py and added this line at L.170:to my surprise, here I get
unfiltered [TextClassification(task='text-classification', text_column='review_text_without_quotes', label_column='review_sentiment')]
. So one way or another, here I suddenly see thatunfiltered
now does have a task_template -- even though that is not what is written in the data loading script, as the first print statement correctly shows.I do not quite understand how, but it seems that the config name and task_templates get mixed.
This ultimately leads to the following error, but this trace may not be very useful in itself:
Environment info
datasets
version: 2.4.1.dev0