sileod / tasksource

Datasets collection and preprocessings framework for NLP extreme multitask learning
Apache License 2.0
149 stars 8 forks source link

TypeError: Unhashable type: 'list' #5

Closed nickypro closed 9 months ago

nickypro commented 1 year ago

In tasksource version 0.0.39, when loading mmlu, I get the following error:

#mmlu = MultipleChoice('question',labels='answer',choices_list='choices',splits=['validation','dev','test'],
#    dataset_name="tasksource/mmlu",
#    config_name=get_dataset_config_names("tasksource/mmlu")
#)
from tasksource.tasks import mmlu

dataset = mmlu.load()

I get the following error:

Traceback (most recent call last):                                        
  File ".venv/lib/python3.10/site-packages/tasksource/p
reprocess.py", line 37, in load                                           
    return self(datasets.load_dataset(self.dataset_name,self.config_name))
  File ".venv/lib/python3.10/site-packages/datasets/loa
d.py", line 2106, in load_dataset                                         
    builder_instance = load_dataset_builder(                              
  File ".venv/lib/python3.10/site-packages/datasets/loa
d.py", line 1829, in load_dataset_builder                                 
    builder_instance: DatasetBuilder = builder_cls(                       
  File ".venv/lib/python3.10/site-packages/datasets/bui
lder.py", line 373, in __init__                                           
    self.config, self.config_id = self._create_builder_config(            
  File ".venv/lib/python3.10/site-packages/datasets/bui
lder.py", line 571, in _create_builder_config                             
    is_custom = (config_id not in self.builder_configs) and config_id != "
default"
TypeError: unhashable type: 'list'

Seems to be because it does not expect the result of get_dataset_config_names("tasksource/mmlu") to be a list of form List[str], (i.e:['abstract_algebra', 'anatomy', 'astronomy', ...])

No such errors for the examples shown in README.md

sileod commented 1 year ago

Hi, mmlu is modeled as multiple tasks, even though they have the same format tasksource.load_task('mmlu/econometrics') works You can use tasksource.concatenate_dataset_dict to concatenate multiple hormonized tasks (e.g. multiple mmlu disciplines)

So I'm not sure whether it's a bug or an undefined behavior