Add support for custom prompts and files metadata via YAML

General

The status of this issue is work-in-progress (will be discussed in our next progress update meeting).
If you have any comments on this new functionality, feel free to comment on this issue.
Lines starting with comment: below represent internal comments for discussion with the software engineering team.

Problem

Currently, the Manubot AI Editor offers a fixed set of section-specific prompts for advanced manuscript revision. These set of section-specific prompts are automatically generated using the manuscript title, its keywords, and the section the text belongs to. However, these prompts are fixed and have specific instructions to improve the text by following some guidelines that might not be the ones a user is expecting. For example, the GitHub user @dhimmel tried to use our tool in one manuscript but reported an aggressive rewriting, whereas he only needed basic copyediting (typos, grammar issues, etc.) and "shortening of select sections, possibly with custom prompts."

Proposed solution

Add two files that allow users to 1) write custom prompts (this file is easily sharable with other users) and 2) define how prompts are applied to manuscript files (this file is specific to the repository and not intended to be shared). Both files are placed in the root folder of the manuscript repository.

`ai_revision-prompts.yaml`

This file is a YAML file.
This file has the custom prompts.
The prompts defined here can access different pieces of information/metadata about the manuscript.
This file is easily sharable with the community, so it doesn't have any manuscript/repository-specific information.

The file has the following structure:

# Potential future feature: variables and templating can be defined here (YAML anchors, etc).

# if we use "prompts_files" as the top-level key, they prompt names are interpreted as regex for file matching
# if we use "prompts" as the top-level key, they they are meant to be referenced from the config file
prompts_files:
  prompt_name: |
      Prompt content that can access the {manuscript.title} or the {manuscript.keywords}
  another_prompt_name: |
      Another prompt definition that does not access any manuscript's metadata.
  \.md$: |
    This would be a default prompt.

Notes:

Variables and templating is a work-in-progress feature and is not included in this iteration. It might come for free using YAML's anchors, but we are not gonna test it now.
Prompt's names also act as a regex that can match file names. This is intended to make prompts more shareable without additional configuration. This feature is assessed per prompt and enabled only if a prompt goes unused in ai_revision-config.yaml (or if that file does not exist). If the feature is enabled for a prompt, then it automatically uses the prompt with filenames matching the prompt_name regex. For example, having a prompt named abstract will apply to all files containing abstract in their names.
Each paragraph in the manuscript is always revised by only one prompt (or not revised at all if no default prompt is provided).
Referencing {manuscript.title} returns a string with the manuscript's title.
Referencing {manuscript.keywords} returns a string with keywords separated by , (comma + space), such as keyword1, keyword2, keyword3.

`ai_revision-config.yaml`

In this issue, this file will specify how prompts (defined in ai_revision-prompts.yaml) are applied to files.
In the future, this file is intended to contain other configuration entries for the AI Revision workflow.

The file has the following structure:

files:
  matchings:
    # in-order list for matching. for each file, find the first entry that matches file(s) and
    #  apply prompt(s).
    - files:
        # always interpreted as regex
        - abstract
        - 04\..*-supplement\.md
      prompt: prompt_name

  # default prompt for files not matched in list above. can also be omitted for no
  #  fallback (file is ignored). also, regex matching above can accommodate
  #  "quasi-defaults" for higher-level-granularity distinctions (maybe like .md files
  #  and .txt files?), i.e. patterns that match many but not all files.
  default_prompt: some_fallback_prompt

  # file(s) to ignore (not revise). overrides `default_prompt` and `matchings`.
  ignore:
    - data
    - quote-that-shouldnt-be-revised

Full examples

Only `ai_revision-prompts.yaml` is defined

Example based on the PhenoPLIER manuscript repository.
File names here are different than in the original manuscript to accommodate for this case (no ai_revision-config.yaml file).

Files under content/ folder (file names modified from the original manuscript):

final_figures/
images/
00.front-matter.md
01.abstract.md
02.introduction.md
04.00.results.md
04.05.00.results.framework.md
04.05.01.results.crispr.md
04.15.results.drug_disease_prediction.md
04.20.00.results.traits_clustering.md
05.discussion.md
07.00.methods.md
10.references.md
15.acknowledgements.md
50.00.supplementary_material.md
manual-references.json
metadata.yaml

`ai_revision-prompts.yaml`

prompts_files:
  abstract: |
    Revise the following paragraph from the Abstract of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  introduction|discussion: |
    Revise the following paragraph from the {file.section.capitalize()} of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  results: |
    Revise the following paragraph from the Results section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      most references to figures and tables are kept,
      the details are enough to clearly explain the outcomes,
      sentences are concise and to the point,
      the text minimizes the use of jargon,
      the text grammar is correct,
      spelling errors are fixed,
      and the text has a clear sentence structure

  methods: |
    Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
       most of the citations to other academic papers are kept,
       most of the technical details are kept,
       most references to equations (such as "Equation (@id)") are kept,
       all equations definitions (such as '*equation_definition') are included with newlines before and after,
       the most important symbols in equations are defined,
       the text grammar is correct,
       spelling errors are fixed,
       and the text has a clear sentence structure

  references: null

  \.md$: |
    Proofread the following paragraph

Notes:

Note we use prompts_files as the top-level key name.
The same prompt is used for files that contain the introduction or discussion sections.

`ai_revision-config.yaml`

This file does not exist in this example.

Both `ai_revision-prompts.yaml` and `ai_revision-config.yaml` are defined

This example follows exactly the same file names in the PhenoPLIER manuscript repository.
The matching between prompts and files should be exactly the same as in the previous example, although here, we manually specify all matchings using the ai_revision-config.yaml file.

Files under content/ folder:

final_figures/
images/
00.front-matter.md
01.abstract.md
02.introduction.md
04.00.results.md
04.05.00.results_framework.md
04.05.01.crispr.md
04.15.drug_disease_prediction.md
04.20.00.traits_clustering.md
05.discussion.md
07.00.methods.md
10.references.md
15.acknowledgements.md
50.00.supplementary_material.md
manual-references.json
metadata.yaml

`ai_revision-prompts.yaml`

prompts:
  abstract: |
    Revise the following paragraph from the Abstract of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  introduction_discussion: |
    Revise the following paragraph from the {file.section.capitalize()} of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  results: |
    Revise the following paragraph from the Results section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      most references to figures and tables are kept,
      the details are enough to clearly explain the outcomes,
      sentences are concise and to the point,
      the text minimizes the use of jargon,
      the text grammar is correct,
      spelling errors are fixed,
      and the text has a clear sentence structure

  methods: |
    Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
       most of the citations to other academic papers are kept,
       most of the technical details are kept,
       most references to equations (such as "Equation (@id)") are kept,
       all equations definitions (such as '*equation_definition') are included with newlines before and after,
       the most important symbols in equations are defined,
       the text grammar is correct,
       spelling errors are fixed,
       and the text has a clear sentence structure

  default: |
    Proofread the following paragraph

Notes:

Note we use prompts as the top-level key name since prompts will be referenced from the config file below.

`ai_revision-config.yaml`

files:
  matchings:
    - files:
        - abstract
      prompt: abstract
    - files:
        - introduction
      prompt: introduction_discussion
    - files:
        - 04\..+\.md
      prompt: results
    - files:
        - discussion
      prompt: introduction_discussion
    - files:
        - methods
      prompt: methods

  default_prompt: default

  ignore:
    - front\-matter
    - acknowledgements
    - supplementary_material
    - references

Notes:

This example too verbose, and it shows clearly that having prompt names that can also be used as regex for file matching in ai_revision-prompts.yaml (suggested by @vincerubinetti) is really convenient.
This example could be converted easily to a mix between "prompts matching file names" and "files that need specific prompts matching" (like for the Results section where not all files have the "results" in their names).

Only a single, generic prompt is defined

This example follows exactly the same file names in Daniel's article on connectivity search.
Daniel only wanted to proofread the manuscript, not use section-specific prompts.

Files under content/ folder:

images/
media/
00.front-matter.md
01.abstract.md
05.main-text.md
90.back-matter.md
manual-references-2023-04-06.json
manual-references.yaml
metadata.yaml
response-to-reviewers.md

`ai_revision-prompts.yaml`

prompts:
  \.md$: |
    Proofread the following paragraph

`ai_revision-config.yaml`

files:
  ignore:
    - front\-matter
    - back\-matter
    - response\-to\-reviewers

Notes:

This example could be written using only the ai_revision-prompts.yaml file with prompts_files as the top-level key instead of prompts and adding one "empty prompt" for each of the ignore list entries (front\-matter: null, etc).

Testing

New/updated unit tests that focus on the parsing of the new files and the correct revision of manuscript files.
- our unit tests currently have mock models that "revise" a paragraph by returning the same paragraph, randomly swapping characters, etc, that could be used.
Fork existing Manubot-based manuscript to perform global testing (triggering the ai_revision workflow from the GitHub interface as a user would do). We could also ask for feedback from the manuscript's authors.

The top level prompts: and sections: seem unnecessary since they should be clear from the name of the file? I guess if you want to do the advanced templating/variables, it's necessary for the prompts file.

Should section in ai_revision-prompts.yaml be sections and accept a list? Seems like it could be useful. Also, if it's only ever just one section, you might as well just remove section, and have the upper level key e.g. prompt_name_1 simply specify prompt type. At that point prompt_name_1 would only serve to be a descriptive name of the prompt, which is valuable, except that that name isn't being used anywhere else and a yaml comment could suffice. It kind of looks like that's what you're doing in the second half of your issue, so I'm not sure which is the actual proposed solution.

section: false # this has the same effect as omitting section

I think requiring this to be the string "default", instead of omitting or setting to false, would be better for clarity. Again it looks like you're doing that in the second half of your issue.

section_name_0 might be more clear as section_type_0, and likewise, section(s) could be explicitly section_types. This might make it more clear there's sort of an intermediate "mapping" stage done with via an abstract "type of section", and that these things do not (always) map directly to filenames. Not trying to be pedantic, but making sure to use distinct terminology here might help reduce confusion.

disable_revision: true. I get why you have this, because you want files not listed to be sent to the default prompt. But it feels a little weird to say "disable revision for these files, but still here's the prompt name/section type it would be under if it was enabled (because it's required for the schema)". Maybe section_name_2 could be a special ignore/bypass/disable keyword for such things.

The example above intentionally mixes the use of variables in YAML and the entire prompt without variables. The latter is clearer, but the former allows the reuse of instruction blocks.

Can both ways be supported, and leave this up to the user? I'd certainly put the simpler, explicit way up front, then in a separate "advanced" section of the documentation, give examples of using variables.

Also is this the same as YAML anchors? It looks like similar syntax with the & and *, but not quite the same as what I've seen before. If we could leverage established YAML syntax and features for this that would be great. But it looks like you're doing some more advanced templating that I'm pretty sure isn't in YAML.

Maybe metadata.yaml could be something more specific and descriptive, like sections (if the top level key is going to be just that), or assignment or map?

If the section name only contains a string (like discussion), then it is interpreted as the file name under content/.

Is there supposed to be a companion sentence to this regarding lists or regexes? Because it seems like section can only ever be a string?

files can be either a list or a string. If it is a list, then it contains strings with file names under content/. If it is a string, it's interpreted as a regular expression.

Maybe this could just always be a regex? I could see wanting a list of regexes. This would make it more of a pain to explicitly match filenames fully like 01.something.md because you'd have to escape the periods, but also how often is that needed. That is, you could just write "abstract" to match 01.abstract.md since there's probably not gonna be a conflict like 01.abstract-reprise.md.

Files that are not selected in this file are not revised by the tool.

Then what is the point of disable_revision: true? I'm thinking that maybe the two halves of your issue are actually separate proposals/solutions, but I wasn't completely clear on that from the language.

This would also make section: introduction in the first file kind of pointless too... I guess you intended that key to be for when users share configs, the prompts have certain filenames that they get applied to by default?

Based on the above, here's my proposal. This does not tackle the templating/variable problem.

prompts.yaml:

# variables and templating stuff

prompts:
  # if this prompt goes unused in sections.yaml (or there's no sections.yaml at all), automatically apply this to filenames matching the prompt_name
  prompt_name: |
      Prompt content with {templating} and #such.

Not sure about the auto-applying "magic". Makes prompts more shareable without additional configuration. But might lead to some people being confused at unexpected revision? Then again, you have to explicitly opt-in to using this AI-revision in the first place, and you'd be manually pasting the prompt in there, so that's kind of giving clear intention/consent. Alternative: just say this should be a descriptive name and/or have a YAML comment to just suggest to other users which types of sections it should be applied to.

sections.yaml:

sections:
  # in-order list for matching. for each file, find the first entry that matches file(s) and apply prompt(s).
  - files:
      # always interpreted as regex
      - abstract
      - 04\..*-supplement\.md
    prompts:
      - prompt_name

# default prompt(s) for files not matched in list above. can also be omitted for no fallback. also, regex matching above can accommodate "quasi-defaults" for higher-level-granularity distinctions (maybe like .md files and .txt files?), i.e. patterns that match many but not all files.
default:
  - some_fallback_prompt

# file(s) to ignore (not revise). overrides `default` and `sections`.
ignore:
  - data
  - quote-that-shouldnt-be-revised

Lists of strings in the structures above can also just be single strings.

Great stuff! Adding a few thoughts below. Some of these may be outside the scope of this issue directly but I felt might spur conversation towards developments here.

Overall feedback:

Consider using a YAML schema of some kind to help validate the structure of these configuration files. Doing this might allow developers and users to more quickly detect or troubleshoot errors with the formatting. Depending on how the structure evolves, jsonschema might be able to be used (recognizing certain limitations over YAML). YAML Schema may also be a possibility here.
Related to the above, but possibly a separate issue, consider adding or using yamllint to enable early static detection of issues in general formatting of YAML files.
Would there ever be a need to account for multiple prompts on one section to build greater context (and increased feedback)? If so, does the proposed format meet this capability? I liked @vincerubinetti restructuring to prompts: becoming a list. This might bring up more questions about how to finalize results provided from an LLM (would it be helpful to see the "discussion" as a cognitive layer of improvement suggestions to the user?).
How might figures and other media factor into this work? Would it make sense for them to be included as explicit elements of focus within the prompts or sections? Currently I'm wondering whether figures could need descriptions and alignment to strengthen the other written content in manuscripts. Perhaps as a future focus when/if multi-modal interpretation capabilities mature in AI assistant offerings, it could make sense to consider prompts like "Provide feedback on whether the figures are understandable and help prove the content found within this paragraph.". It also may make sense to include this from an accessibility standpoint (checking on things like color contrast, how/whether text description encompasses the figure, etc.). It could be that this is an added capability of this project via OCR tooling like tesseract with text extracted from figures becoming a way an LLM could "understand" and contribute thoughts about the media.

Feedback on `ai_revision-prompts.yaml`:

Would it make sense to decouple variable definitions at the top of this file to another file? Doing this might make the variable definitions themselves more flexible/extendable based on someone's desires. It also might allow for runtime flexibility, depending on the design implications (for example, if operationally we always would add certain extra variables in addition to what the user specified). I'm imagining something like Ansible's use of adding extra variables through direct implementation or within files (mostly within another file).
While I was able to find some documentation covering the use of &variable_definition and *variable_usage I wasn't certain of the exact specification here. Depending on your preference and what the audience thinks, consider making use of Jinja2 for templating aspects.
Possibly outside the scope of this issue's focus: consider adding an accessibility-focused prompt within defaults from this project. This might include things like checks for alt-text, language and region tagging, ensuring any mathematical expressions are properly formatted, etc.

Thank you, @vincerubinetti! Those are really great comments. I really liked your proposal, which is much simpler than mine (I will look at it more thoroughly shortly). Below I reply to some of your comments, but I agree with everything you said.

The top level prompts: and sections: seem unnecessary since they should be clear from the name of the file? I guess if you want to do the advanced templating/variables, it's necessary for the prompts file.

I think we should decide on a structure that won't need to be changed later. So if we can anticipate potential future uses without adding unnecessary functionality now. If those top-level elements are necessary to use a YAML feature that is potentially useful in the future, we should keep them. Otherwise, I'm ok with removing them.

Can both ways be supported, and leave this up to the user? I'd certainly put the simpler, explicit way up front, then in a separate "advanced" section of the documentation, give examples of using variables.

Absolutely!

Also is this the same as YAML anchors? It looks like similar syntax with the & and *, but not quite the same as what I've seen before. If we could leverage established YAML syntax and features for this that would be great. But it looks like you're doing some more advanced templating that I'm pretty sure isn't in YAML.

Ah, yeah, "anchors" is the name! I used it before but didn't remember. Yes, absolutely; the idea is to use established YAML syntax, so let's try to avoid any non-standard YAML syntax.

Maybe this could just always be a regex? I could see wanting a list of regexes. This would make it more of a pain to explicitly match filenames fully like 01.something.md because you'd have to escape the periods, but also how often is that needed. That is, you could just write "abstract" to match 01.abstract.md since there's probably not gonna be a conflict like 01.abstract-reprise.md.

I agree, let's use regex for file names.

For maximum clarity for the user, I think we should avoid the per-prompt magic I had mentioned. That is, files only get matched based on prompt names in prompts.yaml if the user has no config.yaml at all. How we communicate this to the user should be: if you have no config file, the plugin basically creates one for you "under the hood" based on your prompt names

Edit: based on our discussion in the meeting, in prompts.yaml we're explicitly going to have separate prompt_files and prompt_ids fields, the former to be used when you don't have a config file, and the latter to be used to explicitly link between prompts.yaml and config.yaml.

@dhimmel @agitter Please take a look and weigh in at your convenience.

~Sorry to muddy the waters more, but I have one more thought after our meeting just now. With the changes we discussed, it's possible we could get away with only having the prompts file...~

# for each file, find first match, apply prompt, then stop
prompts:
  # matching a specific file
  some-regex.+\.md:
    some prompt content lorem ipsum

  # matching a specific file, then providing no prompt to effectively ignore it. must come before the default below so it gets matched first.
  references\.md:
    null

  # example of a "wide net" default. catches a lot of broad things but not everything.
  ".*\.md":
    proofread this paragraph

  # "ultimate" default, catches absolutely everything else 
  ".*":
    proofread this paragraph.

~The upsides:~

~Less config, cleaner.~
~A little simpler and clearer what's going on. Match filename to first matching key, then apply the value as a prompt. Not really any special logic going on, e.g. does config file exist.~

~The downsides:~

~Don't get to specify multiple file regexes in a list for a single prompt, like the current proposal. Theoretically you could combine a list of regexes into a single one for use like this, but it could end up being a monstrosity (bad ergonomics).~
~If someone ever does request that we be able to apply multiple prompts to multiple files (i.e. in my original proposal), this wouldn't support that.~
~Wouldn't support paragraph-level revision, or really any other use case where you'd need to identify which prompt you want from outside the prompts.yaml file, because how would you identify it... ? No, need a simply short name.~

Downsides are too strong, please ignore.

Edit: based on our discussion in the meeting, in prompts.yaml we're explicitly going to have separate prompt_files and prompt_ids fields, the former to be used when you don't have a config file, and the latter to be used to explicitly link between prompts.yaml and config.yaml.

Ok! @falquaddoomi feel free to update the specification with this if you want.

I think I'll take a backseat on the schema design and focus my efforts on any questions related to integrating with Manubot if they arise.

Consider using a YAML schema of some kind to help validate the structure of these configuration files.

This is a good idea (something that has been on my mind for Manubot's metadata.yaml as well). Pydantic might be a good way to define the schema and validate the prompt data. Pydantic can export to jsonschema.

I found the initial proposal from last week fairly complex when I looked through the actual examples. The modified version from @vincerubinetti that listed prompts and then mapped prompts to files seemed simpler, but I'm still taking this all in.

One general goal of Manubot has been to make it accessible to a broader audience that may be comfortable editing content through GitHub but not have many computational skills beyond that. That audience will not necessarily be able to use regular expressions to map prompts to files even if that is the best way to control the mapping.

Is there a reason for the mixed punctuation in the file names like ai_revision-prompts.yaml?

That audience will not necessarily be able to use regular expressions to map prompts to files Is there a reason for the mixed punctuation in the file names like ai_revision-prompts.yaml?

This is a fair point, we want to keep it simple. However, if I'm understanding you right, there doesn't really have to be mixed use-case in prompts. Simply putting a plain string e.g. abstract is still a valid regex that would match e.g. 01.abstract.md. It only becomes a pain when users want to exact-match things with dots in them, e.g. they'd have to do 01\.abstract\.md.

Right, having to escape the dots in exact filenames could surprise a less experienced user. You may still decide regex is the best way to go, in which case I suggest examples or comments that help explain the behavior.

Just an FYI on some related new features and choices I made for https://copyaid.it inspired by this project.

1) I've started using TOML instead of YAML for configuration files and OpenAI API request settings. It's not a huge deal, but I am finding the simplicity of TOML preferable and I think new users will too. There are gotchas with YAML where you have to double quote because some weird sequence is some YAML power feature.

2) The default prompts installed with CopyAId are for the new GPT-4 Turbo model and the source is here: https://gitlab.com/castedo/copyaid/-/tree/main/copyaid/data (the *-example.toml files). I've decided to go ahead and ditch GPT-3.5. The November gpt-3.5-turbo upgrade just seemed overall worse than the June gpt-3.5-turbo. The GPT-4 Turbo is less annoying and is more promising.

3) I'm doing "blast" testing of different prompts and models here https://gitlab.com/castedo/copyblast where I run various prompts in request settings files against various source files to see how different request settings perform.

Hey @castedo, thanks for sharing this. I'll take a look at your prompts and tests. We were researching ways of testing prompts as well and found promptfoo interesting.

Thx for the link to promptfoo, I had not seen that.

I think testing via something like promptfoo might work well for proofreading. I'm still experimenting with how I write with copyaid.it but it feel like there are one particular kind of workflow which is proofreading, especially pre-git-commit proofreading. Proofreading tends to have a predictable clear correct response expected.

Another key to proofreading is it saves response time and money to have OpenAI respond with just "OK" when there are no corrections to be made.

Maybe some testing around proofreading is something we should share. I'll keep you posted next time i'm upgrading my proofreading tests (maybe upgrading to promptfoo).

manubot / manubot-ai-editor