Narrative: Process used to create Gen-AI examples

This issue aims to provide an organized cohesive narrative of the process we used to generate the first set of PRs submitted to Numpy.

We were tasked by Travis Oliphant as part of one of the first POSSEE pods to use AI to help contribute to and improve Numpy. The thoughts were to help with (1) documentation, (2) unit testing, (3) Issue resolution, and/or (4) PR resolution. We started with documentation and in particular example generation. It was a low stakes place to start testing.

We wanted to make sure we understood the codebase enough to help contribute. We have been working for several months to learn the PR process and familiarize ourselves with the codebase. I have been attending community, triage, and documentation meetings somewhat regularly.
For prompt engineering, we tried few-shot and zero-shot prompting. We also tried using RAGNA to incorporate web scraped examples. This led to a discussion about ethical IP issues. I used AI to help create this zero-shot prompt, after seeing another github repo use zero-shot prompting to improve test coverage.
- A lengthy series of discussions with GPT-4o helped me create a prompt that generated examples similar to the basic generic types of examples that appear in the code base.
- Originally, AI wanted to add multiple sentences between each set of examples, which didn't match what's in the majority of the codebase. Some places have zero explanations (see cond and matrix_rank). Some places had a short sentence (see vecdot and matmul ). The majority of places in the linalg module had a brief header with a colon. I decided to go with the colon approach.
- I knew I would not be generating high quality examples such as the Mandlebrot example in outer.
- The prompt I settled on asks AI to examine the existing examples, create a new one if needed, and then repeat the process until no new examples are needed. The last step of the prompt was to explain why each new example was added (which GPT-4o did a much better job with than Llama3-70B did).
- The prompt was a collaborative effort with AI. When it didn't generate what I wanted, I asked it how I could improve the prompt. Then I would start a new session, input the new prompt, and then ask for help revising the prompt again.
I ran the prompt on the functions in the linalg unit. It generated basic examples, sufficient (IMHO) to match the quality of lots of simpler examples in the code base. The example-generator.py script has the full details.
- Using Llama3-8B always resulted in too many repeat examples that were useless. While example generation was fast, it almost always hit the hard cap I set of 15 new examples (way too many). The output was very low quality (basically garbage).
- Using Llama3-70B created decent (basic quality) examples with few repeats (they happen, and can be removed manually with minimal effort).
- I tried using a lower temperature which resulted in more repeats. I settled on a temperature of 0.85. I have not experimented much with other model parameters.
I ran example-generator.py on 829 functions across the entire code-base. I stored the AI output in log files.
- At 3-4 tokens per second, it took the script about 1 full day (24 hours) to run across the entire codebase.
- This slow processing time killed any desire to try more advanced prompting techniques which ask AI to refine generated output.
- Using better hardware, the processing time could be brought to under 8 minutes instead of 24 hours. As an example, using Groq and paying for API access would be over 200 times faster. My guess is that speed will only get faster, enabling more advanced prompting techniques and higher quality examples, all while using less energy.
The last step was to automate the process of getting the examples into the codebase. The key functions for doing this are in example_post_processing.py. A few notes:
- Generated code generally runs.
- The output of an example is often hallucinated. We stripped the generated output and inserted actual output.
- AI often fails with special characters such as . and _. It also cannot reliably copy verbatim. The fuzzywuzzy package helps identify lines that should be the same.
- Next time I write a prompt to run on the entire codebase, I will try alternate formats to make it simpler to extract new examples for insertion. Plenty of revision is possible here. I learned a tons in the process of making the example_post_processing.py script that I wish I had known before I started creating prompts.

When I finally got things working, I was ready to ask the bigger question: Should we include all THESE examples? The simple answer is "NO!" We should definitely NOT include all THESE 1000+ examples into the codebase. We have a proof-of-concept that this could be done. Improvements can be made. Better prompts can generate better examples. But we do have a POC that injects examples into the codebase.

Why did we submit some examples for inclusion in the codebase? The interns will be done with their internship shortly. After a discussion at the triage meeting, we were encourage to submit a few examples for functions that have no examples.

What would help create better examples? There is no current standard of what a "good" example should be. The examples in the code base are all over the place of varying quality. Creating a standard of what we want examples to look like would make it much easier to get AI to generate the types of examples we want. We could use AI to help standardize the examples throughout the codebase, keeping the existing examples and adding basic (when needed) and more advanced (when needed). That's a project for another POSSEE pod another time. I truly think we could use AI to help develop a consistent look for examples.

possee-org / genai-numpy

Narrative: Process used to create Gen-AI examples #124