Ideas: Modular design for Docstrings and Example creators that produces PRs.

bmwoodruff commented 1 month ago

I'd love help brainstorming and organizing a class structure for creating PRs that inject either docstrings or examples into existing code. Here are some of the things that need to happen. I'll focus on example creation in the ideas below. A similar modular design would exist for docstrings.

Locate functions that need examples (maybe have none, maybe have 1, maybe return a dataframe that we can browse to manually look for some). We've already got code to identify functions that are missing the examples section at https://github.com/numpy/numpy/issues/21351#issuecomment-1170574462. This function has to be run on the dev version of NumPy to be current. As for missing docstrings, there are plenty of private functions missing docstrings, which provide a great testing ground for this (but aren't wanted/needed in the code base?).
Given a function that is missing examples, generate more (number to generate could be a parameter). Example functions used in few-shot prompting could be a parameter. Maybe letting the script randomly (or via some other option) select methods from the corresponding class file that already have examples would be better than manual. Maybe do both. Mostly we need at least one way to generate a new example. This is the AI part.
Given an example and a function, inject the example into the source code.
Build the docs and verify that there are no errors (minus the single error that occurs from the Warning messages related to serial writing).
Run the example tester on the example code and verify all tests pass.
It would be nice to have something create a file (like a Jupyter notebook or something) that we can manually open to view and test the changes ourselves.
Create a branch, add the changed file to that branch, commit the changes to that branch (with an appropriate commit message including an AI generated tag).
Push recent changes to GitHub. This one will require we configure authentication appropriately. This step could be left to manual until we find a secure way to do this on Nebari.

bmwoodruff commented 1 month ago

If we have something generate examples, I was thinking it might be nice to have it generate LOTS of example (not just 1 or 2). Then we could ask it to rank the examples, providing reasons for the ranking, and eventually ask us to select the ones to include. If we start with 10 or so examples, and reduce to the best 3, then perhaps those get included in a PR which allows the maintainers to reduce it to 1 or 2. This will require more feedback from the maintainers.

First, we need to produce something, so I think focusing on adding a single example to a single method, all done algorithmically, would be a great step.

ethanchilds commented 1 month ago

I'll take a look at 1 and 5 to see what I can do there.

andrewtggreene commented 1 month ago

I'll work on 2

bmwoodruff commented 1 month ago

@andrewtggreene I'm thinking of a different way to generate examples. I don't think we have to generate both the example, and the output. Instead, we generate a bunch of examples. Then these examples are fed into numpy directly, saving the output. Then we combine these together.

The point would then be to generate a variety of examples, and have AI help create examples that utilize different inputs. By not having it try to generate the solution (which would be pointless for some of the newer functions), we save time.

In addition, as I've played with this, I'm thinking that we need to provide not just the function, but also the full function name. There is no way that AI will know it has to access numpy.linalg.svdvals if I just provide it the function definition. I think it's fair to provide the full name.

andrewtggreene commented 1 month ago

@bmwoodruff I agree. After reviewing the article on unit testing I was thinking of running an initial prompt using fewshot prompting and then running the examples that the AI creates and crafting a new prompt based off of those results, but I think just having the initial examples run and adding the result to the prompt would be more efficient.

I also ran into some issues with this with needing the AI to know where the files are located. I was able to craft a notebook that could run some of the prompts that Llama3 was producing but most had errors running due to either not knowing how to access the function or due to examples using the same variable throught the examples. I think it might be helpful to add a catch for code that doesn't run properly and asks the AI to evaluate fixes.

I was able to get a fairly decent script that ran very well on the 70B model which produces some very workable responses. I tried to include everything in a function to allow for us to create an object that can produce the examples for us. I'll be adding the script to this repo tomorrow. I think it's a pretty decent place to start.

otieno-juma commented 1 month ago

@bmwoodruff I think using a modular design for this class structure would be more systematic. Based on my research i was able to obtain this structure that we can use to guide us as we engineer the script I have a sample script already generated by chatGPT4 that i'm evaluating and getting ideas on how it executes the entire process.

Here's a step-by-step approach:

Class Design: Extractor: Extracts and identifies functions needing examples or docstrings. Injector: Injects the content into the identified places in the code. PRCreator: Creates a pull request with the changes. Manager: Manages the overall process.

otieno-juma commented 1 month ago

Here is the script: import re import os import inspect

class Extractor: def init(self, module): self.module = module

def find_missing_examples(self):
    missing_examples = []
    for name, obj in inspect.getmembers(self.module):
        if inspect.isfunction(obj) or inspect.isclass(obj):
            docstring = inspect.getdoc(obj)
            if docstring:
                example_present = re.search(r'Examples?\n-+\n', docstring)
                if not example_present:
                    missing_examples.append((name, obj))
    return missing_examples

class Injector: def init(self, code_dir): self.code_dir = code_dir

def inject_example(self, obj, example_text):
    source_file = inspect.getfile(obj)
    with open(source_file, 'r') as file:
        code = file.readlines()

    obj_name = obj.__name__
    start_line = 0
    for i, line in enumerate(code):
        if re.search(f'def {obj_name}\(', line) or re.search(f'class {obj_name}\(', line):
            start_line = i
            break

    indent = ' ' * (len(line) - len(line.lstrip()))
    example_section = f'\n{indent}Examples\n{indent}--------\n{example_text}\n'
    docstring_start = start_line
    while docstring_start < len(code) and '"""' not in code[docstring_start]:
        docstring_start += 1
    docstring_end = docstring_start + 1
    while docstring_end < len(code) and '"""' not in code[docstring_end]:
        docstring_end += 1

    code.insert(docstring_end, example_section)

    with open(source_file, 'w') as file:
        file.writelines(code)

def add_examples(self, missing_examples):
    example_text = """
    Example usage:
    >>> import numpy as np
    >>> x = np.array([1, 2, 3])
    >>> np.sum(x)
    6
    """
    for name, obj in missing_examples:
        self.inject_example(obj, example_text)

class PRCreator: def init(self, repo_name, branch_name, commit_message, pr_title, pr_body, token): self.repo_name = repo_name self.branch_name = branch_name self.commit_message = commit_message self.pr_title = pr_title self.pr_body = pr_body self.token = token

def create_pr(self):
    g = Github(self.token)
    try:
        repo = g.get_repo(self.repo_name)
        repo.create_git_ref(ref=f"refs/heads/{self.branch_name}", sha=repo.get_branch("main").commit.sha)

        contents = repo.get_contents("")
        for content_file in contents:
            repo.create_file(content_file.path, self.commit_message, content_file.decoded_content, branch=self.branch_name)

        pr = repo.create_pull(title=self.pr_title, body=self.pr_body, head=self.branch_name, base="main")
        print(f"Pull request created: {pr.html_url}")
    except GithubException as e:
        print(f"Failed to create PR: {e}")

class Manager: def init(self, module, code_dir, repo_name, branch_name, commit_message, pr_title, pr_body, token): self.extractor = Extractor(module) self.injector = Injector(code_dir) self.pr_creator = PRCreator(repo_name, branch_name, commit_message, pr_title, pr_body, token)

def run(self):
    missing_examples = self.extractor.find_missing_examples()
    if missing_examples:
        self.injector.add_examples(missing_examples)
        self.pr_creator.create_pr()
    else:
        print("No missing examples found.")

if name == "main": import numpy as np

MODULE = np
CODE_DIR = "/path/to/numpy/code"  # Adjust to your local path
REPO_NAME = "your-username/your-repo"
BRANCH_NAME = "add-examples"
COMMIT_MESSAGE = "Add examples to missing docstrings"
PR_TITLE = "Add examples to missing docstrings"
PR_BODY = "This PR adds examples to the missing docstrings in the code."
TOKEN = "your-github-token"

manager = Manager(MODULE, CODE_DIR, REPO_NAME, BRANCH_NAME, COMMIT_MESSAGE, PR_TITLE, PR_BODY, TOKEN)
manager.run()

otieno-juma commented 1 month ago

@bmwoodruff I took some time to find a way to mask the token variable used to authenticate with the GitHub API so far I have not figured out a way to hide the token. To use the PRCreator class, you need to provide a valid GitHub token when creating an instance of it. Therefore if we decide to revoke the use of a token variable then we will not have a class working that can automate the Pull requests.

luxedo commented 2 weeks ago

@otieno-juma we usually store the token in environment variables. We can access those variables with the os.environ function. Eg:

# my_script.py
import os
TOKEN = os.environ['GH_TOKEN']

Then:

# In the terminal running your script
export GH_TOKEN=<your token>
python my_script.py

You have to call this export only once, then all subsequent commands will already see the variable.

To have this working in a CI/CD pipeline it's necessary to create a secret. Creating and managing secrets is very easy with GitHub Actions.

otieno-juma commented 1 week ago

@luxedo is it possible to arrange a brief sync with you this coming week to discuss this?

luxedo commented 1 week ago

Yes, please PM me at LinkedIn or email me.

possee-org / genai-numpy

Ideas: Modular design for Docstrings and Example creators that produces PRs. #19