pdhoolia / se-agent

Software Engineering Agent
GNU General Public License v3.0
1 stars 1 forks source link

Expand Scope to Include Non-Code Files for Comprehensive Repository Understanding, localization, and change suggestion #15

Open pdhoolia opened 3 hours ago

pdhoolia commented 3 hours ago

ATM we only consider code files (e.g., .py) for: semantic understanding, localizing issues, and suggesting changes. We should cover all the files in the repository

se-agent commented 3 hours ago

To address the issue of expanding the scope to include non-code files for comprehensive repository understanding, localization, and change suggestion, we need to make changes across several components of the system. The changes will ensure that non-code files such as documentation, configuration files, deployment scripts, and notebooks are considered alongside code files. Below are suggested modifications for each relevant file:

1. file_analyzer.py

def prompt_generate_semantic_description(content, file_type):
    """ Prompt for generating semantic description of a file based on its type. """

    file_description_map = {
        'python': 'python file',
        'markdown': 'markdown document',
        'yaml': 'YAML configuration',
        'json': 'JSON configuration',
        'notebook': 'Jupyter notebook'
        # Add more file types as needed
    }

    file_description = file_description_map.get(file_type, 'file')

    prompt = f"""
Understand the following {file_description} and generate a semantic description for it in markdown format.

{content}


Generated document should follow this structure:

Semantic Summary

A brief semantic summary of the entire file. This should not exceed 100 tokens.

Structures

List of relevant structures, sections, or components in the file with a brief semantic summary for each. Individual summaries should not exceed 50 tokens.


"""
    return prompt

def generate_semantic_description(filepath):
    """
    Generate a semantic description for a file using LLM.
    Args:
        filepath (str): The path to the file.
    Returns:
        str: The generated semantic description.
    """
    if os.path.getsize(filepath) == 0:
        return None

    # Determine file type (e.g., by extension)
    _, file_extension = os.path.splitext(filepath)
    file_type = file_extension.lstrip('.')

    with open(filepath, 'r') as file:
        content = file.read()

    prompt = prompt_generate_semantic_description(content, file_type)
    return extract_code_block_content(
        call_llm_for_task(
            task_name=TaskName.GENERATE_FILE_SUMMARY,
            messages=[
                {"role": "system", "content": "You are an expert on generating semantic descriptions for various file types."},
                {"role": "user", "content": prompt}
            ]
        ).content
    )

2. change_suggester.py

def prompt_generate_change_suggestions(issue_analysis, file_suggestions, files):
    """ Generates the prompt to localize issue to specific files and suggest changes. """

    messages = []

    system_message = {
        'role': 'system',
        'content': f"""You are an AI assistant that specializes in analysing issues and understanding various types of files, and make change suggestions to address issues.

Following files have been suggested as relevant to the issue and discussion:

[FILE-SUGGESTIONS-START]
{file_suggestions}
[FILE-SUGGESTIONS-END]

Here are the corresponding files:
{files}

Based on the issue details and ensuing discussion please suggest changes in these files and (or any new code) along with your reasoning. Consider the context provided by all types of files."""
    }
    messages.append(system_message)

    conversation = issue_analysis.get('conversation', [])
    for message in conversation:
        role = 'user' if message['role'] == 'user' else 'assistant'
        messages.append({'role': role, 'content': message['content']})

    return messages

3. issue_analyzer.py

No specific changes needed unless specific file content should be included in the issue conversation context.

4. package_summary.py

def prompt_generate_package_summary(package_name, documentation):
    """ Generates the prompt for summarizing a package, considering both code and non-code files. """

    prompt = f"""
Understand the following hierarchical documentation for package {package_name}, with semantic description of sub-packages, files, classes, functions, and other structures contained.

```markdown
{documentation}

Now generate an abstractive package summary in markdown format with the following structure:

# <Package Name>

## Semantic Summary
A very crisp description of the full package semantics. This should not exceed 150 tokens.

## Contained structure names
Just a comma-separated listing of contained sub-package, file, class, function, structure, or document names. E.g.,
`package1`, `sub_package`, `file_name.py`, `ClassName`, `function_name`, `doc.md`, `config.yaml` ...

Note: Whole package summary should not exceed 512 tokens. """ return prompt


### 5. `localizer.py`
- Update the localization logic to include non-code files.
- Adjust the file localization suggestion logic to accommodate non-code files.

```python
class FileLocalizationSuggestion(BaseModel):
    package: str
    file: str
    confidence: float
    reason: str

def prompt_localize_to_files(issue_analysis, package_details):
    """ Generates the prompt to localize issue to specific files, including non-code files. """

    messages = []

    system_message = {
        "role": "system",
        "content": f"""You are an AI assistant that specializes in localizing issues to related files based on semantic summaries of packages and files including non-code files.

You return files that are most relevant to the issue in the following JSON format:

```json
{{
  "file_localization_suggestions": [
    {{
      "package": "<Fully qualified package name>",
      "file": "<Name of the file>",
      "confidence": <a floating point number between 0 and 1 with two decimal points indicating the confidence in the suggestion>,
      "reason": "<An explanation of the relevance of this file for the issue (not to exceed 50 tokens)>"
    }}
  ]
}}

Following are the semantic summaries of the files (and their containing packages) that you can refer to:

{package_details}

DO NOT TRY TO SOLVE THE ISSUE. JUST LOCALIZE IT TO THE MOST RELEVANT FILES AND RETURN THE file_localization_suggestions JSON OBJECT. """ + FILE_LOCALIZATION_SUGGESTIONS_FORMAT_INSTRUCTIONS } messages.append(system_message)

conversation = issue_analysis.get('conversation', [])
for message in conversation:
    role = 'user' if message['role'] == 'user' else 'assistant'
    messages.append({'role': role, 'content': message['content']})

return messages


These changes will ensure the system can process and consider a broader range of files when analyzing issues, localizing them, and suggesting changes. This comprehensive approach will help provide a more holistic understanding of repository contexts and necessary modifications.