jlewi / foyle

AI For Software Operations
https://foyle.io
Apache License 2.0
81 stars 8 forks source link

User Report: Learning Isn't Happening #215

Open jlewi opened 2 weeks ago

jlewi commented 2 weeks ago

See this thread. https://discord.com/channels/1102639988832735374/1278048402567069728

User queried the logs and didn't see any entries for BuildLog.

jlewi commented 2 weeks ago

Are the logs Processing Any Log File? https://github.com/jlewi/foyle/blob/4288e91ac805b46103d94230b32dd1bc2f957095/app/pkg/analyze/analyzer.go#L200

That is logged in debug mode so we'd have to bump the log level.

@sourishkrout can you try the following

  1. Run Foyle with debug logs enabled

    foyle serve --log=debug
  2. Execute a couple cells / accept a suggestion

  3. Search the logs for the above message

     jq -c 'select(.msg == "Processing log file")' "/Users/jlewi/Library/Application Support/runme/logs/logs.2024-08-19T19:45:23.json"

Explanation

The Analyzer is the component that processes the logs. It produces traces which are then what we learn from. The message in discord showing no "Build block log" indicated no logs for blocks were being produced. So now we are trying to verify that the Analyzer is processing logs so we can narrow down the problem.

sourishkrout commented 2 weeks ago

Negative. I don't see any Processing log file logs:

https://gist.github.com/sourishkrout/bc7cdccaa828056b3f5c622679ac49bb

By the way, ☝️ would make a good troubleshooting notebook inside the foyle repo.

sourishkrout commented 2 weeks ago

@jlewi I am not sure what this logic does, but it appears my logs always wind up in this condition (see below):

https://github.com/jlewi/foyle/blob/4288e91ac805b46103d94230b32dd1bc2f957095/app/pkg/analyze/analyzer.go#L207-L209

image
sourishkrout commented 2 weeks ago

Oh, looks like that's just tracking "savepoints" to avoid double processing.

sourishkrout commented 2 weeks ago

3. Search the logs for the above message

     jq -c 'select(.msg == "Processing log file")' "/Users/jlewi/Library/Application Support/runme/logs/logs.2024-08-19T19:45:23.json"

Q: Processing log file should appear in Foyle's log and not in the Runme AI's logs, no?

sourishkrout commented 2 weeks ago

Here we go. Logs! Still not learning, though.

https://gist.github.com/sourishkrout/3cdecee8b123a436d35775cfbf4b8dac

jlewi commented 2 weeks ago

@sourishkrout. Your 100% correct it should be the Foyle Logs not the RunMe logs. I should have caught that.

By the way, ☝️ would make a good troubleshooting notebook inside the foyle repo.

That's correct. I should have done that in the first place rather than copying and pasting buggy queries.

Here you go.

https://github.com/jlewi/foyle/blob/ff414043c8996f30a7885ccaf85ac4b26b5916ed/developer_guides/troubleshoot_learning.md

Its a bit rough but try running through it and then sharing the output.

sourishkrout commented 2 weeks ago

Here are my results @jlewi:

https://gist.github.com/sourishkrout/0f77867bb35df94d1b0eb4490c8fc6ff

jlewi commented 2 weeks ago

Thanks @sourishkrout

It looks like you set LASTLOG to the RunMe logs here So all log statements below that are checking the wrong log file.

Does the formula in the notebook not work for you?

export LASTLOG=~/.foyle/logs/raw/$(ls -t ~/.foyle/logs/raw | head -n 1 )

Your earlier trace indicates that a BlockLog was properly computed for your cell so that's good.

In this particular case I wouldn't expect learning to occur. If you look at the BlockLog you can see that contents of the generatedBlock and the executedBlock are the same. The only difference is some whitespace.

You can see here that when processing examples (BlockLogs) we strip whitespace and then compare the contents. If they are the same then we don't learn from that example.

So here's the things to try

  1. Enable debug logs by running with "--level=debug"
  2. Accept a suggested edit cell
  3. Make a non whitespace change to the cell
  4. Use the notebook to get the block trace for that cell
    • Verify the executed and generated contents are different in more than just whitespace
  5. Check the logs for that cell ; there should be a message saying it learned from that example
sourishkrout commented 5 days ago

Okay, so now again with Foyle & Runme on latest main:

https://gist.github.com/sourishkrout/f075e1a7eb7b23e84854383323f666ec#file-troubleshoot_learning-01j7kh2c6177p74pfkcd6w3ns8-md

Changed the troubleshooting notebook slightly also to include a full trace of my most recent log. Is there anything you can spot in the gist, @jlewi?

jlewi commented 5 days ago

@sourishkrout so thanks to your report I was able to identify two bugs that I fixed in #233

  1. The first is a bug in the documentation; the command to check the blocklogs has changed because we now use the ConnectProtocol

    • The updated command is here
  2. The second is there was a bug in adding execution events to the blocklog

So please retry using the latest on main which has the fix. The updated command for checking the blocklog is here https://foyle.io/docs/learning/troubleshoot_learning/#did-block-logs-get-created

Your logs indicate the BlockLog is being generated

{"severity":"info","time":1726158688.428434,"caller":"analyze/analyzer.go:658","function":"github.com/jlewi/foyle/app/pkg/analyze.buildBlockLog","message":"Building block log","blockId":"01J7KH6B5YJ111KEXAAHD97XY8","block":{"id":"01J7KH6B5YJ111KEXAAHD97XY8","gen_trace_id":"da7a66391bad2f64f6f66e056eed6772","exit_code":0,"eval_mode":false,"resource_version":"c7db69f7-869b-4439-92e2-09ba254bc364","suggestion_status":"SuggestionStatusUnknown"}}

So I suspect the blocklog is there so once we start using the updated CURL command we should be able to fetch it and see what's going on.

Note that the location of the troubleshooting guide has changed since I moved it into the docs website.

sourishkrout commented 5 days ago
  • The updated command is here

Where? Did you forget to include the link?

Here's an updated Runme session gist: https://gist.github.com/sourishkrout/babd0e9f32ad8079869d4534f8b3bb4b#file-troubleshoot_learning-01j7mayp580arr60njbes5fbq6-md

jlewi commented 5 days ago

Oh sorry. I did forget the link; https://foyle.io/docs/learning/troubleshoot_learning/#did-block-logs-get-created

jlewi commented 5 days ago

Oh great looks like you found the correct link sorry about that.

It looks like things are working now.

So it looks like #233 fixed the issue with example.binpb files not being computed.

Are you observing something that leads you to think learning isn't working? Are you just not observing the quality you'd expect?

sourishkrout commented 5 days ago

Sorry the example showing up in the output isn't from my machine. The cell is echoing the static content from the original doc since it's plain text. Should have skipped executing it.

The issue I'm observing is still the same where the previously accepted, corrected, and successfully executed does not seem to get retained as part of learning.

jlewi commented 4 days ago

Ahh my mistake.

So it looks like cell 01J7KQPBYCT9VM2KFBY48JC7J0 is your cell?

  1. it looks like the block log is available and indicates different commands
    • Generated command was jq
    • Executed was a CURl
  2. There is no log message corresponding to this line indicating a new examples was detected

So there are a couple possibilities

  1. The block is never enqueued
  2. The block is enqueued but it is never reconciled
  3. Reconcile is invoked but execution is aborted before reaching this line

Blocks get enqueued here by invoking the blockNotifier. That will not happen if the blockNotifier is nil.

Block notifier gets set here

Analyzer.Run is called here.

I can't see any reason why Learner.Enqueue wouldn't be getting invoked.

Next Steps

At this point the best way to probably debug this would be to set a couple breakpoints and see what code paths are getting executed. My suggestion would be to start as follows

  1. Set a breakpoint here (https://github.com/jlewi/foyle/blob/8c57b00bc2b104e387b51e8032919d6ea90d865e/app/pkg/analyze/analyzer.go#L642

    • This will tell us whether the block is being enqueued
  2. Set a breakpoint in learner.Reconcile https://github.com/jlewi/foyle/blob/8c57b00bc2b104e387b51e8032919d6ea90d865e/app/pkg/learn/learner.go#L124

Let me know if you want to pair.

jlewi commented 3 days ago

@sourishkrout I found and fixed another bug #241 due to the logging changes. Fingers crossed that fixes things for you.

sourishkrout commented 9 hours ago

🤞 testing this now.

sourishkrout commented 8 hours ago

No dice. However, the additional cells in the troubleshooting guide might be helpful, @jlewi.

https://gist.github.com/sourishkrout/cceb035482da5da3ec5e4ce294da9207#file-troubleshoot_learning-01j80nc4fqpm7c5ffxyfpx5r5s-md

I'll try the breakpoints and see what happens. I could pair tmrw afternoon if that's easier.

sourishkrout commented 6 hours ago

It looks like I never get past this line, even though one of the two occasions was definitely a generated block:

https://github.com/jlewi/foyle/blob/main/app/pkg/learn/learner.go#L158

jlewi commented 6 hours ago

I'm sorry this is so frustrating. I really appreciate your patience.

It looks like I never get past this line, even though one of the two occasions was definitely a generated block:

Yes.

I think I know what is going on. "learner_enqueued_total" is only 1.
In #241 I changed when we notify the learner because we should only need to notify the learner after processing the executed log entry because that should necessarily be after the generated log entry.

The learner gets notified here. My assumption was that by the time the execution event gets processed the BlockLog has already been processed

But that's not guaranteed because we read the log lines in chunks and only update the blocklog after reading the entire chunk https://github.com/jlewi/foyle/blob/6035823bd821512769415825b4cccba2970ab092/app/pkg/analyze/analyzer.go#L307

So why was it working for me and not you?

So here are some examples that were produced for me

-rw-r--r--    1 jlewi  staff  30077 Sep 16 18:24 01J7YMFGD47MEG6QN46NN0GXKQ.example.binpb
-rw-r--r--    1 jlewi  staff  33214 Sep 16 18:25 01J7YQ57WRP73B0ND35NZY3MVG.example.binpb
-rw-r--r--    1 jlewi  staff   7953 Sep 17 14:34 01J80YESW7NYV5N6DX0168HNAQ.example.binpb

Two things are true in my case

I think those two things contribute to the bug being masked. In particular, the bug means the first time the learner gets notified due to the first execution event the generated block won't be part of the log. The subsequent cell executions lead to additional learner notifications being fired. Debug logs increase the number of logs in between the two execution events. As a result, were more likely to hit the chunk limit in between execution events and therefore combine the log entries and update the BlockLog with the generated block.

jlewi commented 5 hours ago

@sourishkrout Latest fix is #244

sourishkrout commented 4 hours ago

@sourishkrout Latest fix is #244

I can definitely see more activity in the logs now. However, my ~/.foyle/training remains empty, and learning still does not take effect. I made sure that in the example below, I would edit the cell a few times to achieve the desired outcome.

If you're available tomorrow afternoon, we could pair up to work through it. Looking over my shoulder might help, too.

https://gist.github.com/sourishkrout/f42a1b63578f4c95701b8ca332360caf#file-troubleshoot_learning-01j814mfadm7dw0m5b1jdgny3w-md

jlewi commented 3 hours ago

So we're making progress. I think I found the next bug.

Your counters show that there were two learning examples.

learner_blocks_processed{status="learn"} 2
learner_blocks_processed{status="nochange"} 1
learner_blocks_processed{status="notgenerated"} 1
learner_blocks_processed{status="unexecuted"} 10

So we know the code reached here https://github.com/jlewi/foyle/blob/d67a104e1263c243ad6271d5329d154626751576/app/pkg/learn/learner.go#L176

If something went wrong in the subsequent processing we should have seen an error.

I think the bug is that exampleFiles is empty then the for loop never iterates and we don't try to create any exampleFiles https://github.com/jlewi/foyle/blob/d67a104e1263c243ad6271d5329d154626751576/app/pkg/learn/learner.go#L177

getExampleFiles will return an empty list if getTrainingDirs is empty https://github.com/jlewi/foyle/blob/d67a104e1263c243ad6271d5329d154626751576/app/pkg/learn/learner.go#L252C29-L252C44

getTrainingDirs returns empty if Learner in your configuration is empty https://github.com/jlewi/foyle/blob/d67a104e1263c243ad6271d5329d154626751576/app/pkg/config/config.go#L247

This is legacy code. Prior to #211 you had to set the location of the RunMe logs and learner could be nil. Post #211 you shouldn't need to set learner.

My config looks like

learner:
    logDirs: []

Fix incoming.