Task: Automate processing of generated example logs

bmwoodruff commented 3 months ago

Description:

With the generated example logs created (almost 1000), we need a way to automate processing them. This entails multiple things.

Identify algorithmically the new generated examples.
Properly parse the new example. This includes proper space between prose and example. It also involves removing generated output and replacing it with correct output.
A quick report that shows an entire example section with old and new (clearly delineated) for quick review. I want to see this so I can quickly say "Yes/No" to the generated examples. This report could be stored as an issue, or saved in the log folder (probably better), with a summary at the top for which examples to include. At this point the new examples could be numbered for quick reference.
- Another project might be to gather this information (of yes/no) and then use machine learning to train an algorithm to make that decision for us... That's another project for another day.
The above report is not the same as a push to a branch on a fork, with a diff shown. That won't quickly show all the old examples as well as new. We do need a way to automate the push to a branch.
We need way to automatically locating the correct file in the NumPy repo, and inserting the new formatted examples there. I don't see a reason to have this part done until AFTER they have been reviewed.
Automate a push to a fork. We can probably submit all examples in a single file (module) in a single push.

Note that some examples suggest an idea for an example, but the actual examples is garbage. For these examples, I'd like a special note that suggests human intervention to preserve the idea.

In other places, the way to call a function has preferred settings. For example,

Since Numpy 1.10, the @ sybmol is prefered for multiplication on 2d arrays. This is not reflected in the examples throughout the codebase (that could be a project in itself).
The random module had a huge overhaul, but all the generated examples do not use the proper new formatting (nor do the old examples) of first calling rng = np.random.default_rng() and then going from there.
The polynomial examples were generated using the wrong way to call them, but it should be an easy fix. Each submodule has it's own way.
There are probably others. We should make sure we read the module page for each module before we finish review.

Acceptance Criteria:

[x] Algorithmically extract just the newly generated examples from an output log.
[x] Algorithmically extract just the old examples section from an output log.
[x] Process new example code to automatically adhere to formatting standards (almost done with tools/review/reviewtools.py).
[x] ~~Algorithmically create a report that easily shows old/new for quick tech lead review.~~
[x] Algorithmically locate the proper spot in numpy codebase to insert examples, and then insert examples.
[ ] Algorithmically create a branch, add changes from a single module, and then push to a fork.
[ ] Would be nice: Algorithmically process exceptions to preferred ways of calling functions.

bmwoodruff commented 3 months ago

I'm thinking a report may not be needed. Just inject the code into the proper place in the codebase, while working on a branch, and let VS Code's side-by-side Souce Control be the report.

bmwoodruff commented 3 months ago

I'm working on building an example extractor to take out the new examples, and only the new examples, from the generated files. I figured I'd report a bit on the intermediate progress.

There are 829 log files.
By just looking for the Examples(:) section and four spaces to start a line, I can capture the newly generated examples algorithmically for 590/829 files. That's 71% of them. I'm examining outliers to adapt the code and hope to get this above 90% soon.
I'm using multiple measures to identify issues. The first I checked was comparing side-by-side lines of the examples sections, marking each line pair as either the same (True) or different (False). The expectation is that this would either be 0 (no examples to start with) or 1 (new examples were generated). There were only 42 of the 829 files where this was NOT the case. I checked all 42, compiled a comparison list (see examples/swap_issues.txt when the PR is uploaded).
- 27 of the 42 files were Llama 3 issues.
- llama 3 followed the instructions to not use import numpy as numpy. There are multiple spots in the codebase where this was not followed.
- Llama 3 correctly swapped emath to lib.scimath in a lot of files, changing the original docstring. I'm guessing this is because the two functions share the same docstring, and hence no changes should be made.
- There are some funny hallucinations in the examples. In one, Llama3 tries counting from 0 to 15, but skips 2 and counts to 16 instead (in a 4 by 4 array across multiple lines). Llama 3 struggles with . and _ sometimes. It also does not like all caps when coupled with . and or _.

That's enough of a report for now. I wanted to keep track of what I'm doing. I think i have enough written to automate direct example injection into the codebase for the 590 functions (followed then by human review). We can do this one module at a time. I want to polish the scripts up first that will do the post processing (the functions are currently in examples/extract_new_exmaples.py). Once I get things polished up, I'll add docstrings, get rid of my silly debugging hacks, and then hopefully we can use them to inject a thousand or more examples into the codebase.

bmwoodruff commented 3 months ago

I got excited and wanted to share. I'm going to use the fuzzywuzzy package to do text comparisons.

So far it looks like anything above 70% for the WRatio is a decent match (for AI messing something up, or correctly fixing a mistake in the original codebase).

bmwoodruff commented 3 months ago

Well, I got a lot of progress.

Every generated output Examples section has been merged with the original examples section. 100%
The original docstring has zero modifications. 100% guaranteed.
All material that did not match the old docstring has been moved to the end of the docstring for easy deletion with manual inspection as needed.
There are a few places where fuzzywuzzy did not quite match things (I saw one at 66%), and so lots of repeated text got placed at the end of the examples section. I'm considering lowering the threshold to 65% instead of 70, but I need a metric in place to measure output length changes before I do this.

We have code now that does the following:

Opens an output log and extracts the original examples section and the new examples section.
Merges the two sections together, preserving the original docstring at all costs, and appending new content to the end. Some notes are not put in a notes section, and so these notes appear above the extra examples.
Strip output from every generated example, run the examples, and insert the new code over the old code. This code needs to be updated to run the entire examples section (not just new) as some examples depend on previous code (such as rng = np.random.default_rng()).
Correctly locates the old examples section in the code base (proper file), and then replaces that old examples section with the updated examples section. Needed:
We'll need a formatter that properly puts space between text and computations, as well as indents things properly (some are 4 indents, some are 8 - captured from the original "Examples" header and space before).

I think those are the key bits for an automated workflow. I'll work on polishing it up tomorrow.

bmwoodruff commented 3 months ago

I'm going to work on " Algorithmically locate the proper spot in numpy codebase to insert examples, and then insert examples." next.

Thoughts:

Grab the docstring using docstring = eval(module + '.' + func + '.__doc__'), same as in prompt_generator.py.
Replace the prompt_examples with cleaned_output_examples. If this fails (prompt_examples text not found), then that hopefully means the docstring was updated via a pull request since generating the examples. This should impact very few functions. Test this first on a bunch of docstrings and generate a success report.
Next create a new branch of Numpy (so I can quickly undo all changes). Use search_and_replace_phrase in example_post_processing.py to test insertion of examples into an entire module. Use trackinglists/log files to automate this for a module. The testing function which does the insertion needs to generate a list that includes success/fail and which file was changed.
When the above works with a single module, then create a new branch and test on all of Numpy.

I think sending in 1000+ examples to be reviewed at once will not be wanted, but I think tackling an entire module (with the exception of ma and np) could be desirable. Then if at some point the devs want to remove the AI generated examples (maybe legal issues will hit us all), then it can be done easily.

A consistent commit message would be nice. Here is a proposed option adapted from a discussion with @otieno-juma:

DOC: AI-Gen examples for ...

Examples created by Llama3-70B. Reviewed and modified as part of POSSEE.
Co-authored-by: Ben Woodruff <bmwoodruff@gmail.com>
[skip actions] [skip azp] [skip cirrus]

Not sure if adding my name to all of them is needed. My thoughts are that this would provide a BLAME trail that includes me in addition to the interns, if someone wants more information.

bmwoodruff commented 3 months ago

PRs #98, #100, and #101 are all related to this task. #101 took forever, as I could not figure out why escape characters were being removed. I'll record what I learned here, as all solutions from AI were garbage (and hopefully it will find this content when trained in the future).

You can see the problem with disappearing escape sequences with the following minimal example.

content = 'String with \n new lines and \\\\ some \\ backslashes.\n We need a few more \\\\ to help \\ see the problem.'
old_phrase = 'String with \n new lines and \\\\ some \\ backslashes.'
new_phrase = 'String with \n new lines and \\\\ some \\ backslashes.'
pattern = re.compile(re.escape(old_phrase), re.MULTILINE)
new_content = pattern.sub(new_phrase, content)
new_content

The output string is below, and you can clearly see how the replaced content now has lost half the escape characters.

'String with \n new lines and \\ some \\ backslashes.\n We need a few more \\\\ to help \\ see the problem.'

To fix this, just replace all \\ with \\\\ in new_phrase before using pattern.sub. The updated code is

content = 'String with \n new lines and \\\\ some \\ backslashes.\n We need a few more \\\\ to help \\ see the problem.'
old_phrase = 'String with \n new lines and \\\\ some \\ backslashes.'
new_phrase = 'String with \n new lines and \\\\ some \\ backslashes.'
pattern = re.compile(re.escape(old_phrase), re.MULTILINE)
new_content = pattern.sub(new_phrase.replace('\\','\\\\'), content)
new_content

The output is now the correct string:

'String with \n new lines and \\\\ some \\ backslashes.\n We need a few more \\\\ to help \\ see the problem.'

Wasted almost a day on this ...

bmwoodruff commented 3 months ago

What's left?

These I think are crucial before it's ready to use:

Somehow the original docstring has a blank line being deleted on linalg.svd. Gotta track this down. Hopefully easy.
If there was no Examples section in the original docstring, and the generated examples did not include a generated "Example" line, then the new examples are not inserted. So I need to capture and process this case.

These are polishing updates to work on after we get something working:

When generated code doesn't work, it outputs errors. I'm considering leaving this as is, and they get removed in manual revision. Future processing could just remove output in the event of an error.
Some generated text, generated code, and processed output exceed the 79 character max (soon to be 88). Using spin lint captures these, and it can be manually adjusted. Fixing this can be done with black for generated code, textwrap for generated text, but I'm still not sure about generated output. Using spin lint is simple for now.

bmwoodruff commented 3 months ago

PRs #102, #103, #104 are also connected to this task.

As I'm wrapping up automation, I have some new thoughts.

Now that we have a tool to automate example injection into the codebase, should we inject THESE examples?
There were a few spots where I had to create hacks to get around ugly output from AI. Knowing now what it takes to extract examples from AI gen text, I bet we could create a far better prompt.
With a faster tool (use Groq API while it's free), we could easily test prompts on a smaller module, and see what it takes to fully automate code injection. With the right system prompt, I bet we could "fool proof" the formatting... (maybe) so that this is more useful for a large number of projects.

Right now we have examples, lots of them, that we could add into the code base. Whether we add them or not was not the goal for this project, rather our goal was to get a Proof of Concept showing how it can be done. We can do it now.

Rather than inject these examples into the codebase, why not wait. We might as well inject examples for the functions that are missing examples, but we can postpone mass example inclusion for now. We have a POC. Now we can refine the prompt(s) and see if there is buy in from the maintainers.

I do think we should go through and clean a few modules up, with branches fully ready to be included in the main namespace, just so we know how much time that will take and we can showcase an example of the whole process. My thoughts there are:

Have AI create a branch, generate examples for (a module, 20 functions, all of numpy, one at a time?) some number of functions, add all changes to that branch, commit the changes, and push the branch to a fork.
Have a human review, delete, modify, etc., till the branch builds and passes all tests (ready for a PR). Squash changes (not to main, but to this first commit). This way it's completely transparent how humans revised the AI gen components.

This means that each branch would have 2 commits. The AI gen commits will most likely not pass tests.

bmwoodruff commented 1 month ago

I think it's better to not push to a branch instantly, rather not commit the changes so that it's simple to see what's been changed. I'll close this as done.

possee-org / genai-numpy

Task: Automate processing of generated example logs #77