refuel-ai / autolabel

Label, clean and enrich text datasets with LLMs.
https://docs.refuel.ai/
MIT License
2.09k stars 147 forks source link

[Bug]: KeyError: 'file_path' when planning with LabelingAgent #906

Closed sazboxai closed 1 month ago

sazboxai commented 1 month ago

Description: I am encountering a KeyError when using the LabelingAgent from the autolabel library while trying to extract entities such as energy rates and dates from a CSV file containing PDF content. The error occurs during the agent.plan(ds_ent, max_items=70) call.

Steps to Reproduce:

Use the following code configuration:

config = {
    "task_name": "PersonLocationOrgMiscNER",
    "task_type": "named_entity_recognition",
    "dataset": {
        "label_column": "metadata",
        "text_column": "content", 
        "delimiter": ","
    },
    "model": {
      "provider": "openai",
        "name": "gpt-3.5-turbo",
        "params": {}
    },
    "prompt": {
        "task_guidelines": "You are an AI assistant tasked with extracting energy rates for Tension Level 1 from a PDF document that contains energy pricing information...",
        "labels": [
            "month",
            "tension_1_ENEL",
            "tension_1_customer",
            "year"
        ],
        "few_shot_examples": [
            {
                "example": example.df['content'][0],
                "CategorizedLabels": "{'month': ['SEPTIEMBRE'], 'tension_1_ENEL': ['1,007.6450'], 'tension_1_customer': ['954.0973'], 'year':['2024'] }"
            },
            ...
        ],
        "few_shot_selection": "semantic_similarity",
        "few_shot_num": 3,
        "example_template": "Example: {example}\nOutput: {CategorizedLabels}",
    }
}

from autolabel import LabelingAgent, AutolabelDataset
agent = LabelingAgent(config)

ds_ent = AutolabelDataset('tarifas.csv', config=config)

agent.plan(ds_ent, max_items=70)

Ensure the tarifas.csv file has the following structure:

file_path,name,content,metadata
0,https://www.enel.com.co/content/dam/enel-co/es...,ENEl0,Page 1: TARIFAS DE ENERGÍA ELÉCTRICA ($/kWh) _...,{'num_pages': 1}
1,https://www.enel.com.co/content/dam/enel-co/es...,ENEl1,Page 1: Fe de erratas\nEl domingo 16 de junio ...,{'num_pages': 1}
2,https://www.enel.com.co/content/dam/enel-co/es...,ENEl2,Page 1: SECTOR RESIDENCIAL NIVEL DE TENSIÓN 1 ...,{'num_pages': 1}

Error Message:


KeyError Traceback (most recent call last) Cell In[109], line 1 ----> 1 agent.plan(ds_ent, max_items = 70) 2 agent.run(ds_ent, max_items = 70)

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/autolabel/labeler.py:389, in LabelingAgent.plan(self, dataset, max_items, start_index) 380 if ( 381 self.config.explanation_column() 382 and len(seed_examples) > 0 383 and self.config.explanation_column() not in list(seed_examples[0].keys()) 384 ): 385 raise ValueError( 386 f"Explanation column {self.config.explanation_column()} not found in dataset.\nMake sure that explanations were generated using labeler.generate_explanations(seed_file)." 387 ) --> 389 self.example_selector = ExampleSelectorFactory.initialize_selector( 390 self.config, 391 [safe_serialize_to_string(example) for example in seed_examples], 392 dataset.df.keys().tolist(), 393 cache=self.generation_cache is not None, 394 ) 396 if self.config.label_selection(): 397 if self.config.task_type() != TaskType.CLASSIFICATION:

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/autolabel/few_shot/init.py:115, in ExampleSelectorFactory.initialize_selector(config, examples, columns, cache) 109 if algorithm not in [ 110 FewShotAlgorithm.FIXED, 111 FewShotAlgorithm.LABEL_DIVERSITY_RANDOM, 112 ]: 113 params["cache"] = cache --> 115 return example_cls.from_examples(**params)

File /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/langchain/prompts/example_selector/semantic_similarity.py:91, in SemanticSimilarityExampleSelector.from_examples(cls, examples, embeddings, vectorstore_cls, k, input_keys, **vectorstore_cls_kwargs) 73 """Create k-shot example selector using example list and embeddings. 74 75 Reshuffles examples dynamically based on query similarity. (...) 87 The ExampleSelector instantiated, backed by a vector store. 88 """ 89 if input_keys: 90 string_examples = [ ---> 91 " ".join(sorted_values({k: eg[k] for k in input_keys})) 92 for eg in examples 93 ] 94 else: 95 string_examples = [" ".join(sorted_values(eg)) for eg in examples]

KeyError: 'file_path'

Environment:

Python version: 3.12 Expected Behavior: The agent.plan() method should successfully plan the labeling of the dataset without throwing a KeyError.

Additional Information:

I've checked the CSV structure and confirmed that the columns are correctly labeled. The KeyError suggests that the code is attempting to access a key that does not exist in the dataset, possibly indicating a mismatch in expected column names.