AI: Numpy Issue #8378 - Githubissues

bmwoodruff commented 5 months ago

I think we can tackle the numpy issue DOC: genfromtxt missing_values documentation not clear #8378 using AI. The issue itself has a pretty clear description of what's missing (documentation about how to deal with missing values) as well as points to the section of code that we need to document. I used GPT4 to get a rough idea of what can be done:

GPT4 link (not sure if this is viewable by everyone) https://chat.openai.com/share/8ddf5b1f-b71d-4c3b-8ecf-1eabbc91d594

GPT4 created documention, notes, and examples, all related to this issue.

I'd like to

Test the solutions from GPT4. Explore more prompts to see if we can get more/better examples that illustrate edge cases.
Try other AI engines (we'll have an inhouse one soon) and compare results.
Create a solution that we can discuss here.
Share what we've done with charris and get feedback.
Submit a PR to numpy.

I think what we do here can be scaled up and done with a lot more issues related to documentation. After we've done 3-4 similar to this, we could probably automate combing through every documentation issue and getting a rough idea of a possible fix for many.

As a bigger dream, every time someone wants to work on an issue tagged as documentation, it would be great if they could click a button to see AI generated possible solution(s).

bmwoodruff commented 4 months ago

From Anthropic's Claude, I obtained the following:

The issue here is that the genfromtxt function treats each line in the input as a single value, rather than splitting it into separate values. As a result, it doesn't recognize the individual 255 values as missing values.

Here's a modified version of your code that should work as expected:

import numpy as np

def main():
    s = ["255 255", "255 1"]
    bs = [bl.encode() for bl in s]

    m = np.genfromtxt(bs, usemask=True, missing_values=b'255', dtype=np.uint8, delimiter=' ')
    print(m)
    print(m.mask.sum())

if __name__ == "__main__":
    main()

The key change is the addition of the delimiter=' ' parameter to genfromtxt. This tells NumPy to split each line on whitespace characters, treating each resulting value as a separate element.

With this modification, the output should be:

[-- --]
[-- 1]
3

This indicates that a masked array has been created, with the two 255 values masked and replaced with the -- placeholder. The mask.sum() shows that there are 3 masked values in the array.

If you want to explicitly fill the masked values with a specific value instead of the default placeholder, you can use the fill_value parameter of genfromtxt. For example:

m = np.genfromtxt(bs, usemask=True, missing_values=b'255', dtype=np.uint8, delimiter=' ', fill_value=-1)
print(m)
print(m.mask.sum())

This will output:

[-1 -1]
[-1  1]
3

With the masked values filled with -1.

otieno-juma commented 4 months ago

@bmwoodruff Hi Ben, I'm interested in working on this issue.

bmwoodruff commented 4 months ago

@otieno-juma right now you could resolve this issue by manually copy/pasting some examples that GenAI creates (might be nice to see if we can generate multiple with appropriate prompting techniques), and then adding and appropriate description of what the code does when there are missing values. I see two things that could be done to improve the docs, and resolve this issue. I'm guessing both the examples, and a description of how the code handles missing values, could both be tackled with AI.

What would be really cool, and probably 6+ weeks out, is a tool that could read the issue and comments, summarize what needs to happen, realize both an example and extra description would be useful, generate both, put them in the proper spot in the codebase, and generate a PR, all with the click of a button. That might even be 20+ weeks out (long past when we've finished the POSSEE pod).

If you're up for working on this one, then please see what you can get from Nebari. Share what you get on this page (document your findings). If what you get originally doesn't work, share the failures too. When a PR is ready to be complete, we can add a link to this page on the comments to show how AI helped in the process.

When you think you have a solution ready, remember to push your changes to a branch on your personal fork and then ask for a review from me in Zulip. Have fun!

possee-org / genai-numpy

AI: Numpy Issue #8378 #2