Description: I am encountering a KeyError when using the LabelingAgent from the autolabel library while trying to extract entities such as energy rates and dates from a CSV file containing PDF content. The error occurs during the agent.plan(ds_ent, max_items=70) call.
Steps to Reproduce:
Use the following code configuration:
config = {
"task_name": "PersonLocationOrgMiscNER",
"task_type": "named_entity_recognition",
"dataset": {
"label_column": "metadata",
"text_column": "content",
"delimiter": ","
},
"model": {
"provider": "openai",
"name": "gpt-3.5-turbo",
"params": {}
},
"prompt": {
"task_guidelines": "You are an AI assistant tasked with extracting energy rates for Tension Level 1 from a PDF document that contains energy pricing information...",
"labels": [
"month",
"tension_1_ENEL",
"tension_1_customer",
"year"
],
"few_shot_examples": [
{
"example": example.df['content'][0],
"CategorizedLabels": "{'month': ['SEPTIEMBRE'], 'tension_1_ENEL': ['1,007.6450'], 'tension_1_customer': ['954.0973'], 'year':['2024'] }"
},
...
],
"few_shot_selection": "semantic_similarity",
"few_shot_num": 3,
"example_template": "Example: {example}\nOutput: {CategorizedLabels}",
}
}
from autolabel import LabelingAgent, AutolabelDataset
agent = LabelingAgent(config)
ds_ent = AutolabelDataset('tarifas.csv', config=config)
agent.plan(ds_ent, max_items=70)
Ensure the tarifas.csv file has the following structure:
file_path,name,content,metadata
0,https://www.enel.com.co/content/dam/enel-co/es...,ENEl0,Page 1: TARIFAS DE ENERGÍA ELÉCTRICA ($/kWh) _...,{'num_pages': 1}
1,https://www.enel.com.co/content/dam/enel-co/es...,ENEl1,Page 1: Fe de erratas\nEl domingo 16 de junio ...,{'num_pages': 1}
2,https://www.enel.com.co/content/dam/enel-co/es...,ENEl2,Page 1: SECTOR RESIDENCIAL NIVEL DE TENSIÓN 1 ...,{'num_pages': 1}
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/autolabel/labeler.py:389, in LabelingAgent.plan(self, dataset, max_items, start_index)
380 if (
381 self.config.explanation_column()
382 and len(seed_examples) > 0
383 and self.config.explanation_column() not in list(seed_examples[0].keys())
384 ):
385 raise ValueError(
386 f"Explanation column {self.config.explanation_column()} not found in dataset.\nMake sure that explanations were generated using labeler.generate_explanations(seed_file)."
387 )
--> 389 self.example_selector = ExampleSelectorFactory.initialize_selector(
390 self.config,
391 [safe_serialize_to_string(example) for example in seed_examples],
392 dataset.df.keys().tolist(),
393 cache=self.generation_cache is not None,
394 )
396 if self.config.label_selection():
397 if self.config.task_type() != TaskType.CLASSIFICATION:
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/autolabel/few_shot/init.py:115, in ExampleSelectorFactory.initialize_selector(config, examples, columns, cache)
109 if algorithm not in [
110 FewShotAlgorithm.FIXED,
111 FewShotAlgorithm.LABEL_DIVERSITY_RANDOM,
112 ]:
113 params["cache"] = cache
--> 115 return example_cls.from_examples(**params)
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/langchain/prompts/example_selector/semantic_similarity.py:91, in SemanticSimilarityExampleSelector.from_examples(cls, examples, embeddings, vectorstore_cls, k, input_keys, **vectorstore_cls_kwargs)
73 """Create k-shot example selector using example list and embeddings.
74
75 Reshuffles examples dynamically based on query similarity.
(...)
87 The ExampleSelector instantiated, backed by a vector store.
88 """
89 if input_keys:
90 string_examples = [
---> 91 " ".join(sorted_values({k: eg[k] for k in input_keys}))
92 for eg in examples
93 ]
94 else:
95 string_examples = [" ".join(sorted_values(eg)) for eg in examples]
KeyError: 'file_path'
Environment:
Python version: 3.12
Expected Behavior: The agent.plan() method should successfully plan the labeling of the dataset without throwing a KeyError.
Additional Information:
I've checked the CSV structure and confirmed that the columns are correctly labeled.
The KeyError suggests that the code is attempting to access a key that does not exist in the dataset, possibly indicating a mismatch in expected column names.
Description: I am encountering a KeyError when using the LabelingAgent from the autolabel library while trying to extract entities such as energy rates and dates from a CSV file containing PDF content. The error occurs during the agent.plan(ds_ent, max_items=70) call.
Steps to Reproduce:
Use the following code configuration:
Ensure the tarifas.csv file has the following structure:
Error Message:
KeyError Traceback (most recent call last) Cell In[109], line 1 ----> 1 agent.plan(ds_ent, max_items = 70) 2 agent.run(ds_ent, max_items = 70)
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/autolabel/labeler.py:389, in LabelingAgent.plan(self, dataset, max_items, start_index) 380 if ( 381 self.config.explanation_column() 382 and len(seed_examples) > 0 383 and self.config.explanation_column() not in list(seed_examples[0].keys()) 384 ): 385 raise ValueError( 386 f"Explanation column {self.config.explanation_column()} not found in dataset.\nMake sure that explanations were generated using labeler.generate_explanations(seed_file)." 387 ) --> 389 self.example_selector = ExampleSelectorFactory.initialize_selector( 390 self.config, 391 [safe_serialize_to_string(example) for example in seed_examples], 392 dataset.df.keys().tolist(), 393 cache=self.generation_cache is not None, 394 ) 396 if self.config.label_selection(): 397 if self.config.task_type() != TaskType.CLASSIFICATION:
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/autolabel/few_shot/init.py:115, in ExampleSelectorFactory.initialize_selector(config, examples, columns, cache) 109 if algorithm not in [ 110 FewShotAlgorithm.FIXED, 111 FewShotAlgorithm.LABEL_DIVERSITY_RANDOM, 112 ]: 113 params["cache"] = cache --> 115 return example_cls.from_examples(**params)
File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/langchain/prompts/example_selector/semantic_similarity.py:91, in SemanticSimilarityExampleSelector.from_examples(cls, examples, embeddings, vectorstore_cls, k, input_keys, **vectorstore_cls_kwargs) 73 """Create k-shot example selector using example list and embeddings. 74 75 Reshuffles examples dynamically based on query similarity. (...) 87 The ExampleSelector instantiated, backed by a vector store. 88 """ 89 if input_keys: 90 string_examples = [ ---> 91 " ".join(sorted_values({k: eg[k] for k in input_keys})) 92 for eg in examples 93 ] 94 else: 95 string_examples = [" ".join(sorted_values(eg)) for eg in examples]
KeyError: 'file_path'
Environment:
Python version: 3.12 Expected Behavior: The agent.plan() method should successfully plan the labeling of the dataset without throwing a KeyError.
Additional Information:
I've checked the CSV structure and confirmed that the columns are correctly labeled. The KeyError suggests that the code is attempting to access a key that does not exist in the dataset, possibly indicating a mismatch in expected column names.