Closed falquaddoomi closed 4 months ago
Deprecated: This testing procedure is now part of the PR description
Here, I describe the live testing procedure using this feature through GitHub Actions and the ai-revision
workflow.
This is the typical use case for the Manubot AI Editor.
These steps create the conda environment, install Manubot, and override the Manubot AI Editor package using this branch's version.
gh pr checkout 37
conda env create -f environment.yml -n manubot-ai-editor-testing
conda activate manubot-ai-editor-testing
pip install --upgrade manubot[ai-rev]
pip install -e .
Export your OpenAI API key:
export OPENAI_API_KEY="<api_key>"
Run all the unit tests:
pytest --runcost tests/
All unit tests should have passed.
I've tested this new version of the Manubot AI Editor using the original three manuscripts we used in the preprint. I forked the original manuscript repositories here:
For each of these repositories, I've created different branches that reflect the different test cases present in folder tests/config_loader_fixtures
:
pr37_test-both_prompts_config
pr37_test-conflicting_promptsfiles_matchings
pr37_test-only_revision_prompts
pr37_test-single_generic_prompt
(proofreading prompt)pr37_test-prompts_are_used
(checks that prompts are used in each section; similar to test_prompt_config.py/test_prompts_apply_gpt3
which adds a sentinel value to the paragraph)For convenience, I put here the git command to clone each manuscript repo:
git clone git@github.com:pivlab/phenoplier-manuscript-test.git
git clone git@github.com:pivlab/ccc-manuscript-test.git
git clone git@github.com:pivlab/manubot-gpt-manuscript-test.git
Make sure also that there is a repository secret named OPENAI_API_KEY
with your API key.
Before hitting the OpenAI's API, I ran a local revision model (DebuggingManuscriptRevisionModel
) to ensure the right parameters, such as the prompt, is being used for each paragraph.
First, clone one of the manuscripts above, such as PhenoPLIER, and checkout one of the test cases branches.
cd <MANUSCRIPT_REPO_DIR>
git co <TESTING_BRANCH>
Run using a local model for debugging:
manubot ai-revision \
--content-directory content/ \
--model-type DebuggingManuscriptRevisionModel
Then, I checked the following:
{section_name}
, {title}
and keywords
are correctly replaced.Make sure each manuscript's ai-revision
workflow is installing the manubot-ai-editor
version to be tested.
For this, open .github/workflows/ai-revision.yaml
and make sure the last line here is present:
- name: Install Manubot AI revision dependencies
run: |
# install using the same URL used for manubot in build/environment.yml
manubot_line=$(grep "github.com/manubot/manubot" build/environment.yml)
manubot_url=$(echo "$manubot_line" | awk -F"- " '{print $2}')
pip install ${manubot_url}#egg=manubot[ai-rev]
# install manubot-ai-editor for this PR
pip install -U git+https://github.com/falquaddoomi/manubot-ai-editor@issue-31-customprompts-yaml
For the live tests using the OpenAI models:
ai-revision
workflow.Run workflow
combobox, and enter this information in each field:
main
<BRANCH_TO_REVISE>
(such as pr37_test-both_prompts_config
)ai-revision-<BRANCH_TO_REVISE>
Run workflow
and wait for it to finish.ai-revision
workflow run and then on AI Revise
:
Install Manubot AI revision dependencies
, make sure manubot-ai-editor-0.5.0
was installed.Hi @miltondp, I added the following end-to-end tests:
test_prompts_in_final_result
: ensures that the custom prompt is present in the final markdown by using the DebuggingManuscriptRevisionModel()
to insert the prompt text into the manuscripttest_prompts_apply_gpt3
: runs GPT3.5-Turbo via OpenAI's API; the prompts contain a request to insert a unique sentinel value for each file, and the test looks for the sentinelA few notes about the GPT3 test:
cost
test decorator; it's skipped by default, but can be run by adding --runcost
when you invoke pytest.It should be in good shape for you to run your end-to-end test now, or whenever you're ready.
By the way, I'd noticed that the default setting for GPT3CompletionModel()
's model_engine
param is text-davinci-003
, which is apparently deprecated. Following the guidelines on the Deprecations page, I tried using gpt-3.5-turbo-instruct
at first, but this apparently doesn't include chat functionality. I ended up using gpt-3.5-turbo
and that seemed to work well. I wonder if it's worth changing the default to that model.
By the way, I'd noticed that the default setting for
GPT3CompletionModel()
'smodel_engine
param istext-davinci-003
, which is apparently deprecated. Following the guidelines on the Deprecations page, I tried usinggpt-3.5-turbo-instruct
at first, but this apparently doesn't include chat functionality. I ended up usinggpt-3.5-turbo
and that seemed to work well. I wonder if it's worth changing the default to that model.
Good catch! This was updated in manubot/rootstock but it's good to have it updated here as well.
I'm starting the live test and updating my comment here with the procedure I'm using.
Current status (4/29/2024): Waiting for fix with DebuggingManuscriptRevisionModel issue.
I followed the procedure described in this comment.
pr37_test-prompts_are_used
(uses the sentinel values that @falquaddoomi wrote for the unit tests).ai_revision-config.yaml
specifies front-matter
, the file is not revised. I wonder if it's the content type (mostly markdown/templating) that the tool ignores, or if the -
in the regex should be escaped.pr37_test-both_prompts_config
{title}
is incorrectly replaced by debugging-manuscript
{keywords}
is incorrectly replaced by empty string.{section_name}
is correctly replaced by introduction
- (@falquaddoomi) Although the file
ai_revision-config.yaml
specifiesfront-matter
, the file is not revised. I wonder if it's the content type (mostly markdown/templating) that the tool ignores, or if the-
in the regex should be escaped.
I'm pretty sure this is because it's explicitly ignored in https://github.com/manubot/manubot-ai-editor/blob/main/libs/manubot_ai_editor/editor.py#L449. I tried not to make any possibly destructive edits to things that weren't directly involved with the custom prompts system, but I'm happy to remove that code that skips front-matter
if it's no longer needed. I can also look for other exclusions of specific files if you want the custom prompts system to be the source of truth.
I'm pretty sure this is because it's explicitly ignored in https://github.com/manubot/manubot-ai-editor/blob/main/libs/manubot_ai_editor/editor.py#L449. I tried not to make any possibly destructive edits to things that weren't directly involved with the custom prompts system, but I'm happy to remove that code that skips
front-matter
if it's no longer needed. I can also look for other exclusions of specific files if you want the custom prompts system to be the source of truth.
Ah, good catch! Yes, let's remove that code and keep the configuration in one place. Thank you!
Hey @miltondp, I made a few changes that should check for the correct title and keyword replacements in the prompts. Feel free to run your tests again, and thank you.
FYI, the code that does the replacements is here, in case there's something missing that should be replaced: https://github.com/manubot/manubot-ai-editor/blob/f9f89d17d7cbd4be813ba3e2b7069b3e9420d08c/libs/manubot_ai_editor/models.py#L314-L322
It's pretty much identical to the other parts, and probably should be refactored to just use that replacement dictionary now that I'm looking at it again. In fact, I'm going to go ahead and change that, since there's no reason to compute the replacement dict twice.
EDIT: the commit after this comment factors it into a shared dict.
Current status (5/6/2024):
I followed the procedure described in this comment.
PhenoPLIER:
pr37_test-prompts_are_used
(uses the sentinel values that @falquaddoomi wrote for the unit tests).
Local tests: 🟩 PASSED
GitHub Actions tests: 🟩 PASSED
FA: Note that this test may not always pass since it involves telling the LLM to include a specific word in the text, and since LLMs are stochastic it may just not do what we ask. Anyway, this time it worked! All the words we asked for were added to the text. Also note that 04.00.results.md
and 04.05.00.results_framework.md
are not updated since they're listed in the ignore
section in ai_revision-config.yaml
.
pr37_test-both_prompts_config
Local tests: 🟥 FAILED
{title}
is incorrectly replaced by "Debugging Title"{keywords}
is incorrectly replaced by "debugging, keywords".{section_name}
is correctly replaced by introduction
in the introduction and discussion
in the discussionFA: This issue occurs because manubot doesn't pass the title/kewords from the manuscript unless the model is exactly GPT3CompletionModel
, and since this test uses DebuggingManuscriptRevisionModel
it doesn't match and thus never gets pass the correct title/keywords for the manuscript. This is an issue with manubot, not manubot-ai-editor, that's fixed in PR https://github.com/manubot/manubot/pull/370. While that PR has been merged, a new release hasn't been made yet, so running pip install --upgrade manubot[ai-rev]
per the testing directions will still get you the old, broken version.
GitHub Actions tests: 🟨 PASSED?
FA: The action runs to completion, which is why I marked it 'PASSED?' but I can't verify from the output that it's using the title and keywords from the manuscript manifest. I assume it is getting the right arguments, since it's using GPT3CompletionModel
, and manubot is hardcoded to pass the title and keywords to that model, but I can't be sure.
CCC:
Manubot AI Editor:
Thank you for the testing results, @falquaddoomi. You followed this testing procedure, right? (Your link to the comment does not work).
Some comments for the tests on the pr37_test-both_prompts_config
branch:
-v2
). Sorry if I'm missing something: is there any reason to use it? Can we use the original branch (pr37_test-both_prompts_config
)? When we run the action again, the PR gets overwritten.Thank you for the testing results, @falquaddoomi. You followed this testing procedure, right? (Your link to the comment does not work).
Sorry about that; somehow the PR number got dropped from the link, but I updated it and verified that the link is now working. Anyway, yes, I did follow your testing procedure from the now corrected link.
Some comments for the tests on the
pr37_test-both_prompts_config
branch:
- Local tests: does it work if you upgrade Manubot by changing this line? If it does, we can upgrade to the new Manubot version in all manuscripts.
No, that line you linked to references a commit that's older than the fix I introduced at commit https://github.com/manubot/manubot/commit/06fcecb4f9eaf9c62f804e0213e9a905c82725e4. You'd have to update that line to the following:
- git+https://github.com/manubot/manubot@06fcecb4f9eaf9c62f804e0213e9a905c82725e4
I've tested the phenoplier-manuscript-test
repo on the branch pr37_test-both_prompts_config
and verified that, with the commit you'd mentioned it produces the default "Debugging Title", "debugging, keywords" values in the text. With the later commit I mentioned, it produces the correct title and keywords as taken from the manuscript manifest.
- GitHub Actions tests: I see you created a new branch for the PhenoPLIER manuscript (ending with
-v2
). Sorry if I'm missing something: is there any reason to use it? Can we use the original branch (pr37_test-both_prompts_config
)? When we run the action again, the PR gets overwritten.
Well, I didn't want to overwrite your existing PRs so that you could compare the two. Sure, though, I'll just go ahead and run it without the suffix, no problem.
I've tested the
phenoplier-manuscript-test
repo on the branchpr37_test-both_prompts_config
and verified that, with the commit you'd mentioned it produces the default "Debugging Title", "debugging, keywords" values in the text. With the later commit I mentioned, it produces the correct title and keywords as taken from the manuscript manifest.
Perfect! Thank you for checking that. Do you want to create a PR on the three manuscript repos so we update the Manubot version on them? Once we merge this change to use the fixed Manubot version, we can update the branches (with the tests) on each manuscript repository so all these local tests would pass.
Well, I didn't want to overwrite your existing PRs so that you could compare the two. Sure, though, I'll just go ahead and run it without the suffix, no problem.
Ah, of course, I see. No problem with overriding these PRs, we can keep overriding them until the tests pass.
Do you want to finish the other GitHub Actions tests on the PhenoPLIER manuscript test repository? Maybe we can first fix the manubot version (comment above), and then create one branch per each of the unit tests that you wrote in the file tests/config_loader_fixtures
. I tried to document this on the testing procedure comment where I put a list of branch names based on your unit tests, and also on other ideas where we use a simple proofreading prompt. I think we should first get a polished set of evaluation criteria for one manuscript, and when we are clear on that, we can move to the rest of the manuscripts for testing. We should also write the testing procedure and the testing results in the wiki or something more convenient than the comments here. What do you think?
Perfect! Thank you for checking that. Do you want to create a PR on the three manuscript repos so we update the Manubot version on them? Once we merge this change to use the fixed Manubot version, we can update the branches (with the tests) on each manuscript repository so all these local tests would pass.
Sure thing. I created the PRs (below), but then I realized that you didn't specify which branch/repo you wanted them created against; I assumed it was the *-test
repos and against the main
branch, so that you could then rebase the test branches onto main
. If that's not correct, let me know and I can change them to target a different repo/branch.
Ah, of course, I see. No problem with overriding these PRs, we can keep overriding them until the tests pass.
Actually, I see your point about not creating new branches per test run; otherwise the repo will be littered with these branches, and you can always see the history in the single branch per test. I'm going to go ahead and delete the -v2
branches, and we'll just reuse the existing ones as you suggested.
Do you want to finish the other GitHub Actions tests on the PhenoPLIER manuscript test repository? Maybe we can first fix the manubot version (comment above), and then create one branch per each of the unit tests that you wrote in the file
tests/config_loader_fixtures
. I tried to document this on the testing procedure comment where I put a list of branch names based on your unit tests, and also on other ideas where we use a simple proofreading prompt.
Sure, I can run the remaining GitHub Actions. I'll wait for those PRs above to get merged, then I'll rebase each of the testing branches onto main
when that's done, then run the actions.
On a side note, I noticed that the phenoplier manuscript test repo only has two of the branches you mentioned, pr37_test-prompts_are_used
and pr37_test-both_prompts_config
, and the other two manuscript repos don't have any of them. Perhaps I'm misunderstanding that list; I assumed it was a list of branches that existed, not ones you were considering creating. If it's the latter, I can go ahead and create them from the unit tests you referenced. I'm unclear on what the test would look like for a "simple proofreading prompt", so if you could supply that I'd appreciate it.
I think we should first get a polished set of evaluation criteria for one manuscript, and when we are clear on that, we can move to the rest of the manuscripts for testing.
I think that sounds reasonable, although if it's ok with you I'd prefer that you create these evaluation criteria, since I assume you have a better idea of what you intended with the spec and the larger context in which manubot-ai-editor
is used. (For example, I had no idea that manubot was checking explicitly for GPT3CompletionModel and not passing the title and keyword arguments if it wasn't exactly that.) If it's not feasible for you to do it, just let me know and I can write them up, no worries.
We should also write the testing procedure and the testing results in the wiki or something more convenient than the comments here. What do you think?
I totally agree on moving the testing procedure from a comment in the middle of our discussion to either the wiki, or perhaps to the description of this PR so we don't have to search for it. Regarding the testing results, my personal opinion is that they're part of the discussion relating to this PR so I wouldn't mind continuing to post them as comments, but I'm open to moving them somewhere more convenient for you if this is getting unwieldy.
Sure thing. I created the PRs (below), but then I realized that you didn't specify which branch/repo you wanted them created against; I assumed it was the
*-test
repos and against themain
branch, so that you could then rebase the test branches ontomain
. If that's not correct, let me know and I can change them to target a different repo/branch.
Perfect. I approved all the PRs on these manuscripts, so you can merge them and then rebase the test branches.
Ah, of course, I see. No problem with overriding these PRs, we can keep overriding them until the tests pass.
Actually, I see your point about not creating new branches per test run; otherwise the repo will be littered with these branches, and you can always see the history in the single branch per test.
Exactly.
I'm going to go ahead and delete the
-v2
branches, and we'll just reuse the existing ones as you suggested.
Perfect.
Do you want to finish the other GitHub Actions tests on the PhenoPLIER manuscript test repository? Maybe we can first fix the manubot version (comment above), and then create one branch per each of the unit tests that you wrote in the file
tests/config_loader_fixtures
. I tried to document this on the testing procedure comment where I put a list of branch names based on your unit tests, and also on other ideas where we use a simple proofreading prompt.Sure, I can run the remaining GitHub Actions. I'll wait for those PRs above to get merged, then I'll rebase each of the testing branches onto
main
when that's done, then run the actions.
👍🏻
On a side note, I noticed that the phenoplier manuscript test repo only has two of the branches you mentioned,
pr37_test-prompts_are_used
andpr37_test-both_prompts_config
, and the other two manuscript repos don't have any of them. Perhaps I'm misunderstanding that list; I assumed it was a list of branches that existed, not ones you were considering creating. If it's the latter, I can go ahead and create them from the unit tests you referenced.
Yes, that's what I meant: if you can create those branches based on your unit tests. Sorry for the confusion. Feel free to assess whether we want to add all the unit tests or just a subset. Since the testing on GitHub Actions is more time-consuming for us, it would make sense to include only tests relevant to this context (GitHub Actions) since they are already being tested via unit tests.
I'm unclear on what the test would look like for a "simple proofreading prompt", so if you could supply that I'd appreciate it.
Yes, absolutely. I'll add these after you merge your PRs (with the manubot version update) into the manuscripts.
I think we should first get a polished set of evaluation criteria for one manuscript, and when we are clear on that, we can move to the rest of the manuscripts for testing.
I think that sounds reasonable, although if it's ok with you I'd prefer that you create these evaluation criteria, since I assume you have a better idea of what you intended with the spec and the larger context in which
manubot-ai-editor
is used. (For example, I had no idea that manubot was checking explicitly for GPT3CompletionModel and not passing the title and keyword arguments if it wasn't exactly that.) If it's not feasible for you to do it, just let me know and I can write them up, no worries.
Yes, I can work on the criteria for testing and then we can review it together. Once we feel we are covering all the functionality that we want to cover, we can move on to the other manuscripts.
We should also write the testing procedure and the testing results in the wiki or something more convenient than the comments here. What do you think?
I totally agree on moving the testing procedure from a comment in the middle of our discussion to either the wiki, or perhaps to the description of this PR so we don't have to search for it. Regarding the testing results, my personal opinion is that they're part of the discussion relating to this PR so I wouldn't mind continuing to post them as comments, but I'm open to moving them somewhere more convenient for you if this is getting unwieldy.
Perfect, I like that. Do you want to move the testing procedure to the PR description? Looks like everyone can edit the PR description, so that's great. I would add, at the top, a "history of changes" (including dates) that gets updated when major changes are included. Regarding the testing results, I like also the idea of having them posted as comments, but I would probably keep a single one, also with a history of changes (including dates).
This PR addresses #31. Specifically it adds support for parsing two new configuration files,
ai_revision-config.yaml
andai_revision-prompts.yaml
that specify how filenames are mapped to prompts, and what the text of those prompts are.In the following, I refer to determining the prompt for a given piece of text as "prompt resolution". The current code supports the following mechanisms for prompt resolution:
AI_EDITOR_CUSTOM_PROMPT
environment variable; if specified, this text is used as-is as the prompt.models.GPT3CompletionModel.get_prompt()
This PR adds a YAML-based filename to prompt resolution mechanism, detailed in issue #31. The mechanism is implemented in the class
prompt_config.ManuscriptPromptConfig
.Usage:
Once instantiated, an instance of the
ManuscriptPromptConfig
class can be used to resolve a prompt based on a filename via theManuscriptPromptConfig.get_prompt_for_filename()
method. The method takesfilename
as an argument, which it then uses to consult its configuration for a matching prompt. The method returns a tuple like(prompt : str|None, match : re.match|None)
, where:prompt
is the prompt text matching the filename, orNone
if the file was explicitly ignored in the config or if no match was found, andmatch
is the resulting re.Match instance from when the filename was applied to a regex expression. In cases where a regex wasn't applied (e.g. if we're returning the default prompt), this value isNone
.The optional
use_default
argument (True
by default) allows the method to return a default prompt specified in the config files if no match was found. If it'sFalse
and no match was found,(None, None)
is returned.Integration into the existing code:
An instance of this class is created in
ManuscriptEditor
's constructor asself.prompt_config
. This instance is used inManuscriptEditor.revise_manuscript()
to resolve a prompt for each filename in the manuscript.The resolved prompt is passed down all the way to each models'
get_prompt()
method. InGPT3CompletionModel.get_prompt()
, after checking for a custom prompt, the resolved prompt is then used. If the resolved prompt is falsey, then prompt resolution occurs as it did before (i.e., the section name is inferred and used to find a prompt and, failing that, a default prompt is used.)History of changes
Testing Procedure
Here, I describe the live testing procedure using this feature through GitHub Actions and the
ai-revision
workflow. This is the typical use case for the Manubot AI Editor.Setting up the Manubot AI Editor
These steps create the conda environment, install Manubot, and override the Manubot AI Editor package using this branch's version.
Export your OpenAI API key:
Unit tests
Run all the unit tests:
All unit tests should have passed.
Manuscripts used for testing
I've tested this new version of the Manubot AI Editor using the original three manuscripts we used in the preprint. I forked the original manuscript repositories here:
For each of these repositories, I've created different branches that reflect the different test cases present in folder
tests/config_loader_fixtures
:pr37_test-both_prompts_config
pr37_test-conflicting_promptsfiles_matchings
pr37_test-only_revision_prompts
pr37_test-single_generic_prompt
(proofreading prompt)pr37_test-prompts_are_used
(checks that prompts are used in each section; similar totest_prompt_config.py/test_prompts_apply_gpt3
which adds a sentinel value to the paragraph)For convenience, I put here the git command to clone each manuscript repo:
Make sure also that there is a repository secret named
OPENAI_API_KEY
with your API key.Local tests: DebuggingManuscriptRevisionModel
Before hitting the OpenAI's API, I ran a local revision model (
DebuggingManuscriptRevisionModel
) to ensure the right parameters, such as the prompt, is being used for each paragraph.First, clone one of the manuscripts above, such as PhenoPLIER, and checkout one of the test cases branches.
Run using a local model for debugging:
Then, I checked the following:
{section_name}
,{title}
andkeywords
are correctly replaced.GitHub Actions tests (uses GPT3CompletionModel)
Setting up workflow
Make sure each manuscript's
ai-revision
workflow is installing themanubot-ai-editor
version to be tested. For this, open.github/workflows/ai-revision.yaml
and make sure the last line here is present:Kicking off workflow
For the live tests using the OpenAI models:
ai-revision
workflow.Run workflow
combobox, and enter this information in each field:main
<BRANCH_TO_REVISE>
(such aspr37_test-both_prompts_config
)ai-revision-<BRANCH_TO_REVISE>
Run workflow
and wait for it to finish.Assertions
ai-revision
workflow run and then onAI Revise
:Install Manubot AI revision dependencies
, make suremanubot-ai-editor-0.5.0
was installed.