monarch-initiative / ontogpt

LLM-based ontological extraction tools, including SPIRES
https://monarch-initiative.github.io/ontogpt/
BSD 3-Clause "New" or "Revised" License
576 stars 72 forks source link

How to customize or configure prompts for languages other than English? #440

Closed fishfree closed 2 weeks ago

fishfree commented 3 weeks ago

Because I'd like to extract ontologies from non-English texts. Many thanks!

caufieldjh commented 3 weeks ago

Hi @fishfree - depending on the kind of text, you may not have to do much.

For example, if I use an extraction template like this:

id: http://w3id.org/ontogpt/food
name: food
title: Food Extraction Template
description: >-
  A template for extracting food names and terms from text.
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
  rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
  foodon: http://purl.obolibrary.org/obo/foodon_
  GO: http://purl.obolibrary.org/obo/GO_
  food: http://w3id.org/ontogpt/food
  linkml: https://w3id.org/linkml/

default_prefix: food
default_range: string

imports:
  - linkml:types
  - core

classes:
  FoodSet:
    tree_root: true
    is_a: NamedEntity
    attributes:
      terms:
        range: FoodTerm
        multivalued: true
        description: >-
          A semicolon-separated list of any names of foods.

  FoodTerm:
    is_a: NamedEntity
    id_prefixes:
      - FOODON
    annotations:
      annotators: sqlite:obo:foodon
      prompt: >-
        The name of a food.
        Examples include: apple juice,
        okra pod, chocolate substitute,
        breakfast cereal, tuna (flaked, canned),
        beef chuck roast

And an input document like this:

My Shopping List
apples
bananas
canned soup
carrot cake
flour
cocoa powder
coffee

then the extraction result with llama3.1 405B is:

extracted_object:
  id: 5feb617e-1866-4f3d-818f-c2cdf54a839e
  label: My Shopping List
  terms:
    - FOODON:00002473
    - FOODON:00004183
    - AUTO:canned%20soup
    - FOODON:00002515
    - FOODON:03301116
    - FOODON:03301072
    - FOODON:03301036
named_entities:
  - id: FOODON:00002473
    label: apples
  - id: FOODON:00004183
    label: bananas
  - id: AUTO:canned%20soup
    label: canned soup
  - id: FOODON:00002515
    label: carrot cake
  - id: FOODON:03301116
    label: flour
  - id: FOODON:03301072
    label: cocoa powder
  - id: FOODON:03301036
    label: coffee

Great so far, but that's just English. In French, the input is:

  Ma liste de courses
  pommes
  bananes
  soupe en conserve
  gâteau aux carottes
  farine
  poudre de cacao
  café

And unsurprisingly, we get good extraction but poor grounding, because the source ontology (FOODON) is all English.

extracted_object:
  id: d648384e-9b66-4789-9f2d-a731e6fcf614
  label: Ma liste de courses
  terms:
    - AUTO:pommes
    - AUTO:bananes
    - AUTO:soupe%20en%20conserve
    - AUTO:g%C3%A2teau%20aux%20carottes
    - AUTO:farine
    - AUTO:poudre%20de%20cacao
    - AUTO:caf%C3%A9
named_entities:
  - id: AUTO:pommes
    label: pommes
  - id: AUTO:bananes
    label: bananes
  - id: AUTO:soupe%20en%20conserve
    label: soupe en conserve
  - id: AUTO:g%C3%A2teau%20aux%20carottes
    label: gâteau aux carottes
  - id: AUTO:farine
    label: farine
  - id: AUTO:poudre%20de%20cacao
    label: poudre de cacao
  - id: AUTO:caf%C3%A9
    label: café

So this time, if I include --system-message "Please translate the input text to English before performing any further operations." along with the extract command, I get good extraction and grounding:

extracted_object:
  id: d353a257-2db0-4697-8c6f-7f62d0162874
  label: My Shopping List
  terms:
    - FOODON:00002473
    - FOODON:00004183
    - AUTO:canned%20soup
    - FOODON:00002515
    - FOODON:03301116
    - FOODON:03301072
    - FOODON:03301036
named_entities:
  - id: FOODON:00002473
    label: apples
  - id: FOODON:00004183
    label: bananas
  - id: AUTO:canned%20soup
    label: canned soup
  - id: FOODON:00002515
    label: carrot cake
  - id: FOODON:03301116
    label: flour
  - id: FOODON:03301072
    label: cocoa powder
  - id: FOODON:03301036
    label: coffee

This instruction could also be included in the prompt itself, and that may be preferable if you want to retain some terms in their original language but translate others. There is also the option to include instructions in the prompt in the target (non-English) language, and that may help in some cases, but it may not improve the final results.

caufieldjh commented 3 weeks ago

For reference, the above template is here: https://github.com/monarch-initiative/ontogpt/blob/main/src/ontogpt/templates/food.yaml (it's just named food)

fishfree commented 3 weeks ago

@caufieldjh Thank you very much! I will give a try.

caufieldjh commented 2 weeks ago

Closing as we have a solution - feel free to open a new issue if any problems arise with non-English extractions.