neulab / prompt2model

prompt2model - Generate Deployable Models from Natural Language Instructions
Apache License 2.0
1.94k stars 171 forks source link

ValueError: Column name input_col not in the dataset. Current columns in the dataset: [] #291

Open bf-yang opened 1 year ago

bf-yang commented 1 year ago

When I run python cli_demo.py, it reports errors:

Generating examples: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 26273.52it/s] The generated dataset is ready. The model has not been trained. Processing datasets. Traceback (most recent call last): File "/home/bufang/prompt2model/cli_demo.py", line 435, in main() File "/home/bufang/prompt2model/cli_demo.py", line 321, in main t5_modified_dataset_dicts = t5_processor.process_dataset_dict( File "/home/bufang/prompt2model/prompt2model/dataset_processor/base.py", line 100, in process_dataset_dict dataset_dict[dataset_split] File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 563, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, kwargs) File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 528, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2901, in map return self.remove_columns(remove_columns) File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 563, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 528, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/fingerprint.py", line 511, in wrapper out = func(dataset, args, **kwargs) File "/home/bufang/yes/envs/pt2model/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2064, in remove_columns raise ValueError( ValueError: Column name input_col not in the dataset. Current columns in the dataset: []

How to fix this error? Thanks

zhaochenyang20 commented 1 year ago

Thanks for your bug report!

ValueError: Column name input_col not in the dataset. Current columns in the dataset: []

This means your dataset is an empty dataset, right? 🤔

Could you check if your two datasets, [dataset_dict, retrieved_dataset_dict], which have some potential error? I guess is the problem of retrieved_dataset_dict, so may you send us your prompt, chosen retrieved dataset, and chosen columns?

@bf-yang

bf-yang commented 1 year ago

@zhaochenyang20 Sure. In fact, I just refer to the prompt examples provided by this repo. The prompt is: """Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.

Here are examples with input questions and context passages, along with their expected outputs:

input="Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50." output="Santa Clara"

input="Question: What river runs through Warsaw? Context: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east-central Poland, roughly 260 kilometres (160 mi) from the Baltic Sea and 300 kilometres (190 mi) from the Carpathian Mountains. Its population is estimated at 1.740 million residents within a greater metropolitan area of 2.666 million residents, which makes Warsaw the 9th most-populous capital city in the European Union. The city limits cover 516.9 square kilometres (199.6 sq mi), while the metropolitan area covers 6,100.43 square kilometres (2,355.39 sq mi)." output="Vistula River"

input="Question: The Ottoman empire controlled territory on three continents, Africa, Asia and which other? Context: The Ottoman Empire was an imperial state that lasted from 1299 to 1923. During the 16th and 17th centuries, in particular at the height of its power under the reign of Suleiman the Magnificent, the Ottoman Empire was a powerful multinational, multilingual empire controlling much of Southeast Europe, Western Asia, the Caucasus, North Africa, and the Horn of Africa. At the beginning of the 17th century the empire contained 32 provinces and numerous vassal states. Some of these were later absorbed into the empire, while others were granted various types of autonomy during the course of centuries." output="Europe" """

Then it shows: 1): yulongmannlp/dev_para Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 2): yulongmannlp/dev_orig Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 3): yulongmannlp/adv_para Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 4): yulongmannlp/adv_ori Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 5): squad Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 6): lhoestq/squad Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 7): lhoestq/custom_squad Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 8): hapandya/sqnnr Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 9): Yulong-W/squadpararobustness Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 10): Yulong-W/squadpara Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 11): Yulong-W/squadorirobustness Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 12): Yulong-W/squadori Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 13): MajdTannous/Test3 Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 14): MajdTannous/Test2 Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 15): MajdTannous/Dataset1 Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 16): BerMaker/test Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 17): TenzinGayche/Demo-datasets Stanford Question Answering Dataset (DemoDatasets) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. 18): SajjadAyoubi/persian_qa \\\\Persian Question Answering (PersianQA) Dataset is a reading comprehension dataset on Persian Wikipedia. The crowd-sourced dataset consists of more than 9,000 entries. Each entry can be either an impossible to answer or a question with one or more answers spanning in the passage (the context) from which the questioner proposed the question. Much like the SQuAD2.0 dataset, the impossible or unanswerable questions can be utilized to create a system which "knows that it doesn't know the answer". 19): web_questions This dataset consists of 6,642 question/answer pairs. The questions are supposed to be answerable by Freebase, a large knowledge graph. The questions are mostly centered around a single named entity. The questions are popular ones asked on the web (at least in 2013). 20): race Race is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students. The dataset can be served as the training and test sets for machine comprehension. 21): EleutherAI/race Race is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students. The dataset can be served as the training and test sets for machine comprehension. 22): the-coorporation/the_squad_qg A preprocessed version of the Standford Question Answering Dataset (SQuAD) version 2.0 consisting of contexts and questions only. Duplicate contexts have been removed and corresponding questions have been merged into an array per context. Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. 23): imagenet-1k ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, ImageNet hopes to offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy. ImageNet 2012 is the most commonly used subset of ImageNet. This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images 24): AlexFierro9/imagenet-1k_test ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). ImageNet aims to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, ImageNet hopes to offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy. ImageNet 2012 is the most commonly used subset of ImageNet. This dataset spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images 25): dbpedia_14 The DBpedia ontology classification dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014. They are listed in classes.txt. From each of thse 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples. Therefore, the total size of the training dataset is 560,000 and testing dataset 70,000. There are 3 columns in the dataset (same for train and test splits), corresponding to class index (1 to 14), title and content. The title and content are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). There are no new lines in title or content.

Then I select this dataset. image

Chosen columns are: image

Chosen model is: image

However, it finally reports error: image

After that, it report error like this.

zhaochenyang20 commented 1 year ago

Oops. Let me see!

zhaochenyang20 commented 1 year ago

I rerun your process but get the right result of the retrieved dataset. 🤔

image

Would you please run this script to save your retrieved_dataset separately?

from prompt2model.prompt_parser import OpenAIInstructionParser, TaskType

prompt = """Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.

Here are examples with input questions and context passages, along with their expected outputs:

input="Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50."
output="Santa Clara"

input="Question: What river runs through Warsaw? Context: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east-central Poland, roughly 260 kilometres (160 mi) from the Baltic Sea and 300 kilometres (190 mi) from the Carpathian Mountains. Its population is estimated at 1.740 million residents within a greater metropolitan area of 2.666 million residents, which makes Warsaw the 9th most-populous capital city in the European Union. The city limits cover 516.9 square kilometres (199.6 sq mi), while the metropolitan area covers 6,100.43 square kilometres (2,355.39 sq mi)."
output="Vistula River"

input="Question: The Ottoman empire controlled territory on three continents, Africa, Asia and which other? Context: The Ottoman Empire was an imperial state that lasted from 1299 to 1923. During the 16th and 17th centuries, in particular at the height of its power under the reign of Suleiman the Magnificent, the Ottoman Empire was a powerful multinational, multilingual empire controlling much of Southeast Europe, Western Asia, the Caucasus, North Africa, and the Horn of Africa. At the beginning of the 17th century the empire contained 32 provinces and numerous vassal states. Some of these were later absorbed into the empire, while others were granted various types of autonomy during the course of centuries."
output="Europe"
"""

prompt_spec = OpenAIInstructionParser(task_type=TaskType.TEXT_GENERATION)
prompt_spec.parse_from_prompt(prompt)
print(f"Instruction: {prompt_spec.instruction}")
print(f"exmaples: {prompt_spec.examples}")

from prompt2model.dataset_retriever import DescriptionDatasetRetriever

retriever = DescriptionDatasetRetriever()
retrieved_dataset_dict = retriever.retrieve_dataset_dict(prompt_spec)
retrieved_dataset_dict.save_to_disk("retrieved_dataset")
zhaochenyang20 commented 1 year ago

BTW, we are developing our colab demo. Maybe it would be easier to use.

zhaochenyang20 commented 1 year ago

https://colab.research.google.com/github/neulab/prompt2model/blob/add_colab_demo/colab_demo.ipynb

Try this.

neubig commented 1 year ago

Hi, I don't think this is fixed. I just ran the cli_demo.py on the main branch and encountered the same error.

Traceback (most recent call last):
  File "/Users/gneubig/work/prompt2model/cli_demo.py", line 435, in <module>
    main()
  File "/Users/gneubig/work/prompt2model/cli_demo.py", line 321, in main
    t5_modified_dataset_dicts = t5_processor.process_dataset_dict(
  File "/Users/gneubig/work/prompt2model/prompt2model/dataset_processor/base.py", line 100, in process_dataset_dict
    dataset_dict[dataset_split]
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2995, in map
    return self.remove_columns(remove_columns)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/fingerprint.py", line 511, in wrapper
    out = func(dataset, *args, **kwargs)
  File "/Users/gneubig/anaconda3/envs/promp2model2/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2155, in remove_columns
    raise ValueError(
ValueError: Column name input_col not in the dataset. Current columns in the dataset: []
zhaochenyang20 commented 1 year ago

Which dataset? generated or retrieved?

neubig commented 1 year ago

This line: https://github.com/neulab/prompt2model/blob/e11144e15a5ae080859e24ab2daa6e373b6a4ef1/cli_demo.py#L321

zhaochenyang20 commented 1 year ago

DATASET_DICTS = [dataset_dict, retrieved_dataset_dict]

I am pretty familiar with generated_dataset, so I guess the problem lies in retrieved_dataset_dict. 🤔

Have you checked it?