manubot / manubot-ai-editor

BSD 3-Clause "New" or "Revised" License
37 stars 8 forks source link

Add support for custom prompts and files metadata via YAML #31

Closed miltondp closed 2 months ago

miltondp commented 1 year ago

General

Problem

Currently, the Manubot AI Editor offers a fixed set of section-specific prompts for advanced manuscript revision. These set of section-specific prompts are automatically generated using the manuscript title, its keywords, and the section the text belongs to. However, these prompts are fixed and have specific instructions to improve the text by following some guidelines that might not be the ones a user is expecting. For example, the GitHub user @dhimmel tried to use our tool in one manuscript but reported an aggressive rewriting, whereas he only needed basic copyediting (typos, grammar issues, etc.) and "shortening of select sections, possibly with custom prompts."

Proposed solution

Add two files that allow users to 1) write custom prompts (this file is easily sharable with other users) and 2) define how prompts are applied to manuscript files (this file is specific to the repository and not intended to be shared). Both files are placed in the root folder of the manuscript repository.

ai_revision-prompts.yaml

The file has the following structure:

# Potential future feature: variables and templating can be defined here (YAML anchors, etc).

# if we use "prompts_files" as the top-level key, they prompt names are interpreted as regex for file matching
# if we use "prompts" as the top-level key, they they are meant to be referenced from the config file
prompts_files:
  prompt_name: |
      Prompt content that can access the {manuscript.title} or the {manuscript.keywords}
  another_prompt_name: |
      Another prompt definition that does not access any manuscript's metadata.
  \.md$: |
    This would be a default prompt.

Notes:

ai_revision-config.yaml

The file has the following structure:

files:
  matchings:
    # in-order list for matching. for each file, find the first entry that matches file(s) and
    #  apply prompt(s).
    - files:
        # always interpreted as regex
        - abstract
        - 04\..*-supplement\.md
      prompt: prompt_name

  # default prompt for files not matched in list above. can also be omitted for no
  #  fallback (file is ignored). also, regex matching above can accommodate
  #  "quasi-defaults" for higher-level-granularity distinctions (maybe like .md files
  #  and .txt files?), i.e. patterns that match many but not all files.
  default_prompt: some_fallback_prompt

  # file(s) to ignore (not revise). overrides `default_prompt` and `matchings`.
  ignore:
    - data
    - quote-that-shouldnt-be-revised

Full examples

Only ai_revision-prompts.yaml is defined

Files under content/ folder (file names modified from the original manuscript):

final_figures/
images/
00.front-matter.md
01.abstract.md
02.introduction.md
04.00.results.md
04.05.00.results.framework.md
04.05.01.results.crispr.md
04.15.results.drug_disease_prediction.md
04.20.00.results.traits_clustering.md
05.discussion.md
07.00.methods.md
10.references.md
15.acknowledgements.md
50.00.supplementary_material.md
manual-references.json
metadata.yaml

ai_revision-prompts.yaml

prompts_files:
  abstract: |
    Revise the following paragraph from the Abstract of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  introduction|discussion: |
    Revise the following paragraph from the {file.section.capitalize()} of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  results: |
    Revise the following paragraph from the Results section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      most references to figures and tables are kept,
      the details are enough to clearly explain the outcomes,
      sentences are concise and to the point,
      the text minimizes the use of jargon,
      the text grammar is correct,
      spelling errors are fixed,
      and the text has a clear sentence structure

  methods: |
    Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
       most of the citations to other academic papers are kept,
       most of the technical details are kept,
       most references to equations (such as "Equation (@id)") are kept,
       all equations definitions (such as '*equation_definition') are included with newlines before and after,
       the most important symbols in equations are defined,
       the text grammar is correct,
       spelling errors are fixed,
       and the text has a clear sentence structure

  references: null

  \.md$: |
    Proofread the following paragraph

Notes:

ai_revision-config.yaml

This file does not exist in this example.

Both ai_revision-prompts.yaml and ai_revision-config.yaml are defined

Files under content/ folder:

final_figures/
images/
00.front-matter.md
01.abstract.md
02.introduction.md
04.00.results.md
04.05.00.results_framework.md
04.05.01.crispr.md
04.15.drug_disease_prediction.md
04.20.00.traits_clustering.md
05.discussion.md
07.00.methods.md
10.references.md
15.acknowledgements.md
50.00.supplementary_material.md
manual-references.json
metadata.yaml

ai_revision-prompts.yaml

prompts:
  abstract: |
    Revise the following paragraph from the Abstract of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  introduction_discussion: |
    Revise the following paragraph from the {file.section.capitalize()} of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      the research problem/question is clear,
      the solution proposed is clear,
      the text grammar is correct,
      spelling errors are fixed,
      and the text is in active voice and has a clear sentence structure

  results: |
    Revise the following paragraph from the Results section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
      most references to figures and tables are kept,
      the details are enough to clearly explain the outcomes,
      sentences are concise and to the point,
      the text minimizes the use of jargon,
      the text grammar is correct,
      spelling errors are fixed,
      and the text has a clear sentence structure

  methods: |
    Revise the paragraph(s) below from the Methods section of an academic paper (with the title '{manuscript.title}' and keywords '{manuscript.keywords}') so
       most of the citations to other academic papers are kept,
       most of the technical details are kept,
       most references to equations (such as "Equation (@id)") are kept,
       all equations definitions (such as '*equation_definition') are included with newlines before and after,
       the most important symbols in equations are defined,
       the text grammar is correct,
       spelling errors are fixed,
       and the text has a clear sentence structure

  default: |
    Proofread the following paragraph

Notes:

ai_revision-config.yaml

files:
  matchings:
    - files:
        - abstract
      prompt: abstract
    - files:
        - introduction
      prompt: introduction_discussion
    - files:
        - 04\..+\.md
      prompt: results
    - files:
        - discussion
      prompt: introduction_discussion
    - files:
        - methods
      prompt: methods

  default_prompt: default

  ignore:
    - front\-matter
    - acknowledgements
    - supplementary_material
    - references

Notes:

Only a single, generic prompt is defined

Files under content/ folder:

images/
media/
00.front-matter.md
01.abstract.md
05.main-text.md
90.back-matter.md
manual-references-2023-04-06.json
manual-references.yaml
metadata.yaml
response-to-reviewers.md

ai_revision-prompts.yaml

prompts:
  \.md$: |
    Proofread the following paragraph

ai_revision-config.yaml

files:
  ignore:
    - front\-matter
    - back\-matter
    - response\-to\-reviewers

Notes:

Testing

vincerubinetti commented 1 year ago

The top level prompts: and sections: seem unnecessary since they should be clear from the name of the file? I guess if you want to do the advanced templating/variables, it's necessary for the prompts file.


Should section in ai_revision-prompts.yaml be sections and accept a list? Seems like it could be useful. Also, if it's only ever just one section, you might as well just remove section, and have the upper level key e.g. prompt_name_1 simply specify prompt type. At that point prompt_name_1 would only serve to be a descriptive name of the prompt, which is valuable, except that that name isn't being used anywhere else and a yaml comment could suffice. It kind of looks like that's what you're doing in the second half of your issue, so I'm not sure which is the actual proposed solution.


section: false # this has the same effect as omitting section

I think requiring this to be the string "default", instead of omitting or setting to false, would be better for clarity. Again it looks like you're doing that in the second half of your issue.


section_name_0 might be more clear as section_type_0, and likewise, section(s) could be explicitly section_types. This might make it more clear there's sort of an intermediate "mapping" stage done with via an abstract "type of section", and that these things do not (always) map directly to filenames. Not trying to be pedantic, but making sure to use distinct terminology here might help reduce confusion.


disable_revision: true. I get why you have this, because you want files not listed to be sent to the default prompt. But it feels a little weird to say "disable revision for these files, but still here's the prompt name/section type it would be under if it was enabled (because it's required for the schema)". Maybe section_name_2 could be a special ignore/bypass/disable keyword for such things.


The example above intentionally mixes the use of variables in YAML and the entire prompt without variables. The latter is clearer, but the former allows the reuse of instruction blocks.

Can both ways be supported, and leave this up to the user? I'd certainly put the simpler, explicit way up front, then in a separate "advanced" section of the documentation, give examples of using variables.

Also is this the same as YAML anchors? It looks like similar syntax with the & and *, but not quite the same as what I've seen before. If we could leverage established YAML syntax and features for this that would be great. But it looks like you're doing some more advanced templating that I'm pretty sure isn't in YAML.


Maybe metadata.yaml could be something more specific and descriptive, like sections (if the top level key is going to be just that), or assignment or map?


If the section name only contains a string (like discussion), then it is interpreted as the file name under content/.

Is there supposed to be a companion sentence to this regarding lists or regexes? Because it seems like section can only ever be a string?


files can be either a list or a string. If it is a list, then it contains strings with file names under content/. If it is a string, it's interpreted as a regular expression.

Maybe this could just always be a regex? I could see wanting a list of regexes. This would make it more of a pain to explicitly match filenames fully like 01.something.md because you'd have to escape the periods, but also how often is that needed. That is, you could just write "abstract" to match 01.abstract.md since there's probably not gonna be a conflict like 01.abstract-reprise.md.


Files that are not selected in this file are not revised by the tool.

Then what is the point of disable_revision: true? I'm thinking that maybe the two halves of your issue are actually separate proposals/solutions, but I wasn't completely clear on that from the language.

This would also make section: introduction in the first file kind of pointless too... I guess you intended that key to be for when users share configs, the prompts have certain filenames that they get applied to by default?

vincerubinetti commented 1 year ago

Based on the above, here's my proposal. This does not tackle the templating/variable problem.

prompts.yaml:

# variables and templating stuff

prompts:
  # if this prompt goes unused in sections.yaml (or there's no sections.yaml at all), automatically apply this to filenames matching the prompt_name
  prompt_name: |
      Prompt content with {templating} and #such.

Not sure about the auto-applying "magic". Makes prompts more shareable without additional configuration. But might lead to some people being confused at unexpected revision? Then again, you have to explicitly opt-in to using this AI-revision in the first place, and you'd be manually pasting the prompt in there, so that's kind of giving clear intention/consent. Alternative: just say this should be a descriptive name and/or have a YAML comment to just suggest to other users which types of sections it should be applied to.

sections.yaml:

sections:
  # in-order list for matching. for each file, find the first entry that matches file(s) and apply prompt(s).
  - files:
      # always interpreted as regex
      - abstract
      - 04\..*-supplement\.md
    prompts:
      - prompt_name

# default prompt(s) for files not matched in list above. can also be omitted for no fallback. also, regex matching above can accommodate "quasi-defaults" for higher-level-granularity distinctions (maybe like .md files and .txt files?), i.e. patterns that match many but not all files.
default:
  - some_fallback_prompt

# file(s) to ignore (not revise). overrides `default` and `sections`.
ignore:
  - data
  - quote-that-shouldnt-be-revised

Lists of strings in the structures above can also just be single strings.

d33bs commented 1 year ago

Great stuff! Adding a few thoughts below. Some of these may be outside the scope of this issue directly but I felt might spur conversation towards developments here.

Overall feedback:

Feedback on ai_revision-prompts.yaml:

miltondp commented 1 year ago

Thank you, @vincerubinetti! Those are really great comments. I really liked your proposal, which is much simpler than mine (I will look at it more thoroughly shortly). Below I reply to some of your comments, but I agree with everything you said.

The top level prompts: and sections: seem unnecessary since they should be clear from the name of the file? I guess if you want to do the advanced templating/variables, it's necessary for the prompts file.

I think we should decide on a structure that won't need to be changed later. So if we can anticipate potential future uses without adding unnecessary functionality now. If those top-level elements are necessary to use a YAML feature that is potentially useful in the future, we should keep them. Otherwise, I'm ok with removing them.

Can both ways be supported, and leave this up to the user? I'd certainly put the simpler, explicit way up front, then in a separate "advanced" section of the documentation, give examples of using variables.

Absolutely!

Also is this the same as YAML anchors? It looks like similar syntax with the & and *, but not quite the same as what I've seen before. If we could leverage established YAML syntax and features for this that would be great. But it looks like you're doing some more advanced templating that I'm pretty sure isn't in YAML.

Ah, yeah, "anchors" is the name! I used it before but didn't remember. Yes, absolutely; the idea is to use established YAML syntax, so let's try to avoid any non-standard YAML syntax.

Maybe this could just always be a regex? I could see wanting a list of regexes. This would make it more of a pain to explicitly match filenames fully like 01.something.md because you'd have to escape the periods, but also how often is that needed. That is, you could just write "abstract" to match 01.abstract.md since there's probably not gonna be a conflict like 01.abstract-reprise.md.

I agree, let's use regex for file names.

vincerubinetti commented 1 year ago

For maximum clarity for the user, I think we should avoid the per-prompt magic I had mentioned. That is, files only get matched based on prompt names in prompts.yaml if the user has no config.yaml at all. How we communicate this to the user should be: if you have no config file, the plugin basically creates one for you "under the hood" based on your prompt names

Edit: based on our discussion in the meeting, in prompts.yaml we're explicitly going to have separate prompt_files and prompt_ids fields, the former to be used when you don't have a config file, and the latter to be used to explicitly link between prompts.yaml and config.yaml.

vincerubinetti commented 1 year ago

@dhimmel @agitter Please take a look and weigh in at your convenience.

vincerubinetti commented 1 year ago

~Sorry to muddy the waters more, but I have one more thought after our meeting just now. With the changes we discussed, it's possible we could get away with only having the prompts file...~

# for each file, find first match, apply prompt, then stop
prompts:
  # matching a specific file
  some-regex.+\.md:
    some prompt content lorem ipsum

  # matching a specific file, then providing no prompt to effectively ignore it. must come before the default below so it gets matched first.
  references\.md:
    null

  # example of a "wide net" default. catches a lot of broad things but not everything.
  ".*\.md":
    proofread this paragraph

  # "ultimate" default, catches absolutely everything else 
  ".*":
    proofread this paragraph.

~The upsides:~

~The downsides:~

Downsides are too strong, please ignore.

miltondp commented 1 year ago

Edit: based on our discussion in the meeting, in prompts.yaml we're explicitly going to have separate prompt_files and prompt_ids fields, the former to be used when you don't have a config file, and the latter to be used to explicitly link between prompts.yaml and config.yaml.

Ok! @falquaddoomi feel free to update the specification with this if you want.

dhimmel commented 1 year ago

I think I'll take a backseat on the schema design and focus my efforts on any questions related to integrating with Manubot if they arise.

Consider using a YAML schema of some kind to help validate the structure of these configuration files.

This is a good idea (something that has been on my mind for Manubot's metadata.yaml as well). Pydantic might be a good way to define the schema and validate the prompt data. Pydantic can export to jsonschema.

agitter commented 1 year ago

I found the initial proposal from last week fairly complex when I looked through the actual examples. The modified version from @vincerubinetti that listed prompts and then mapped prompts to files seemed simpler, but I'm still taking this all in.

One general goal of Manubot has been to make it accessible to a broader audience that may be comfortable editing content through GitHub but not have many computational skills beyond that. That audience will not necessarily be able to use regular expressions to map prompts to files even if that is the best way to control the mapping.

Is there a reason for the mixed punctuation in the file names like ai_revision-prompts.yaml?

vincerubinetti commented 1 year ago

That audience will not necessarily be able to use regular expressions to map prompts to files Is there a reason for the mixed punctuation in the file names like ai_revision-prompts.yaml?

This is a fair point, we want to keep it simple. However, if I'm understanding you right, there doesn't really have to be mixed use-case in prompts. Simply putting a plain string e.g. abstract is still a valid regex that would match e.g. 01.abstract.md. It only becomes a pain when users want to exact-match things with dots in them, e.g. they'd have to do 01\.abstract\.md.

agitter commented 1 year ago

Right, having to escape the dots in exact filenames could surprise a less experienced user. You may still decide regex is the best way to go, in which case I suggest examples or comments that help explain the behavior.

castedo commented 11 months ago

Just an FYI on some related new features and choices I made for https://copyaid.it inspired by this project.

1) I've started using TOML instead of YAML for configuration files and OpenAI API request settings. It's not a huge deal, but I am finding the simplicity of TOML preferable and I think new users will too. There are gotchas with YAML where you have to double quote because some weird sequence is some YAML power feature.

2) The default prompts installed with CopyAId are for the new GPT-4 Turbo model and the source is here: https://gitlab.com/castedo/copyaid/-/tree/main/copyaid/data (the *-example.toml files). I've decided to go ahead and ditch GPT-3.5. The November gpt-3.5-turbo upgrade just seemed overall worse than the June gpt-3.5-turbo. The GPT-4 Turbo is less annoying and is more promising.

3) I'm doing "blast" testing of different prompts and models here https://gitlab.com/castedo/copyblast where I run various prompts in request settings files against various source files to see how different request settings perform.

miltondp commented 11 months ago

Hey @castedo, thanks for sharing this. I'll take a look at your prompts and tests. We were researching ways of testing prompts as well and found promptfoo interesting.

castedo commented 11 months ago

Thx for the link to promptfoo, I had not seen that.

I think testing via something like promptfoo might work well for proofreading. I'm still experimenting with how I write with copyaid.it but it feel like there are one particular kind of workflow which is proofreading, especially pre-git-commit proofreading. Proofreading tends to have a predictable clear correct response expected.

Another key to proofreading is it saves response time and money to have OpenAI respond with just "OK" when there are no corrections to be made.

Maybe some testing around proofreading is something we should share. I'll keep you posted next time i'm upgrading my proofreading tests (maybe upgrading to promptfoo).