pdhoolia / se-agent

Software Engineering Agent
GNU General Public License v3.0
1 stars 1 forks source link

Support for evaluation of se-agent using SWE-bench #42

Open pdhoolia opened 3 days ago

pdhoolia commented 3 days ago

SWE-bench provides datasets to evaluate software engineering agents.

Each evaluation task instance (mapping to our concept of issue) provides:

  1. repo: mapping to our repo_full_name
  2. instance_id: mapping to our issue
  3. problem_statement: mapping to our issue body
  4. patch: mapping to the combination of our localization & change suggestions

Following challenges need to be addressed to evaluate our agent using the SWE-bench dataset:

  1. Our agent needs to be added as a collaborator on the repo. However, SWE-bench is for offline evaluation. To address this we should be able to construct a Project with a pre-cloned repository as well.

    1. Project should not keep Github as a member (requiring authenticating with GitHub at the time of construction)
    2. Let's add another function in Project that returns an authenticated Github object.
    3. Functions (e.g., clone_repository, pull_latest_changes, post_issue_comment, and fetch_issue_comments) that interact with the Github repo may call that new function to get an authenticated Github object.
    4. With this change, Project object may be constructed in an offline fashion for externally cloned repository.
  2. Agent operates on the latest snapshot of the repository. However, for evaluation purposes the requirement is different. we need the agent to operate on a specific commit hash.

    1. Add a function to Project that allows reseting the repo state to a specified commit hash or by default: the HEAD
    2. Semantic understanding update method in the Project starts by pulling the latest changes in the repo. This is good for live agent but not for evaluation. To facilitate evaluation, let's add a new method to vector_store_utils: create_vector_store(source_dir: str, uri: str, embeddings: Embeddings, commit_hash: str=None) -> VectorStore. This should create a vector_store at the specified uri, by embedding all the file contents in the source_dir, it should use the relative file path as ids, as well add it as metadata.filepath. It should also keep track of uri(s) it created where commit_hash was available, and if for a commit has vector store was previously created as a uri, it should just load and return it (instead of fresh creating the vector store)
  3. Core processor doesn't have a method to drive evaluation for a task instance.

    1. Let's add a new method to listener_core.py: evaluate(repo_full_name: str, instance_id: str, problem_statement: str, commit_hash: str). The method should:
      1. create a project instance.
      2. reset to project repo to commit_hash
      3. create a vector store with source_folder for the commit_hash using the new method we introduce above in vector_store_utils.
      4. Create a semantic_vector_search localizer with this vector_store
      5. localize
      6. suggest changes
      7. dump change suggestions to a file
      8. reset project repo back to HEAD
    2. For storing vector stores and change suggestions towards evaluation,
      1. let's create a folder in project metadata named evaluation
      2. For each evaluation instance_id, we create a folder in that named <instance_id>
      3. vector_store for the instance_id, and for a specific source type should be named <vector-type>_vector_store.db, e.g., for source code embeddings that should be code_vector_store.db
      4. change suggestions should be dumped in a file named change_suggestions.md in the <instance_id> folder
se-agent commented 3 days ago

To address the challenges outlined for evaluating the se-agent using the SWE-bench dataset, the following changes and additions to the existing codebase should be implemented:

1. Modify Project class to support offline evaluation

Changes in se_agent/project.py:

2. Add functionality to reset repository state and handle vector stores

Changes in se_agent/project.py:

Changes in se_agent/vector_store_utils.py:

3. Add evaluation method to listener_core

Changes in se_agent/listener_core.py:

These changes will allow the se-agent to be evaluated using the SWE-bench dataset in an offline manner, operating on specific commit snapshots, and storing both vector stores and change suggestions for each evaluation instance.

pdhoolia commented 3 days ago

can you please document the evaluate method in listener_core.py using proper python doc comments

se-agent commented 3 days ago

Certainly! Here's the evaluate method with detailed Python doc comments:

def evaluate(repo_full_name: str, instance_id: str, problem_statement: str, commit_hash: str):
    """
    Evaluates the se-agent for a given task instance using the SWE-bench dataset.

    This method sets up a project from a pre-cloned repository, resets it to a specific
    commit, and then performs localization and change suggestion tasks. It stores the
    results in a designated evaluation directory.

    Args:
        repo_full_name (str): The full name of the repository (e.g., "owner/repo").
        instance_id (str): The unique identifier for the evaluation task instance.
        problem_statement (str): The problem statement or issue body to be evaluated.
        commit_hash (str): The commit hash to which the repository should be reset.

    Steps:
        1. Initialize a Project instance with the provided repository information.
        2. Reset the repository to the specified commit hash.
        3. Create an evaluation directory under the project's metadata folder.
        4. Create a vector store for the code at the specified commit.
        5. Use SemanticVectorSearchLocalizer to localize relevant files.
        6. Generate change suggestions based on the localization results.
        7. Dump the change suggestions to a markdown file in the evaluation directory.
        8. Reset the repository back to the HEAD of the main branch.

    Raises:
        Exception: Propagates any exceptions encountered during the evaluation process.
    """
    projects_store = os.getenv('PROJECTS_STORE')
    github_token = os.getenv('GITHUB_TOKEN')

    # Create ProjectInfo and Project instance
    project_info = ProjectInfo(repo_full_name=repo_full_name)
    project = Project(github_token, projects_store, project_info)

    # Reset repo to the specified commit hash
    project.reset_to_commit(commit_hash)

    # Prepare evaluation directory
    evaluation_folder = os.path.join(project.metadata_folder, 'evaluation', instance_id)
    os.makedirs(evaluation_folder, exist_ok=True)

    # Create a vector store for the specific commit
    vector_store_uri = os.path.join(evaluation_folder, 'code_vector_store.db')
    embeddings = fetch_llm_for_task(TaskName.EMBEDDING)
    vector_store = create_vector_store(project.module_src_folder, vector_store_uri, embeddings, commit_hash)

    # Create a SemanticVectorSearchLocalizer with this vector store
    localizer = SemanticVectorSearchLocalizer(vector_store)

    # Localize and suggest changes
    analysis_results = {
        'title': problem_statement,
        'description': '',
        'conversation': []
    }
    filepaths = localizer.localize(issue=analysis_results, top_n=TOP_N)
    change_suggestions = suggest_changes(project, analysis_results, filepaths)

    # Dump change suggestions to a file
    change_suggestions_path = os.path.join(evaluation_folder, 'change_suggestions.md')
    with open(change_suggestions_path, 'w') as f:
        f.write(change_suggestions)

    # Reset the repo back to HEAD
    project.reset_to_commit()

This documentation provides an overview of the method's purpose, describes the parameters, outlines the steps performed, and indicates potential exceptions that might be raised during execution.