Closed fishfree closed 2 weeks ago
Hi @fishfree - depending on the kind of text, you may not have to do much.
For example, if I use an extraction template like this:
id: http://w3id.org/ontogpt/food
name: food
title: Food Extraction Template
description: >-
A template for extracting food names and terms from text.
license: https://creativecommons.org/publicdomain/zero/1.0/
prefixes:
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
foodon: http://purl.obolibrary.org/obo/foodon_
GO: http://purl.obolibrary.org/obo/GO_
food: http://w3id.org/ontogpt/food
linkml: https://w3id.org/linkml/
default_prefix: food
default_range: string
imports:
- linkml:types
- core
classes:
FoodSet:
tree_root: true
is_a: NamedEntity
attributes:
terms:
range: FoodTerm
multivalued: true
description: >-
A semicolon-separated list of any names of foods.
FoodTerm:
is_a: NamedEntity
id_prefixes:
- FOODON
annotations:
annotators: sqlite:obo:foodon
prompt: >-
The name of a food.
Examples include: apple juice,
okra pod, chocolate substitute,
breakfast cereal, tuna (flaked, canned),
beef chuck roast
And an input document like this:
My Shopping List
apples
bananas
canned soup
carrot cake
flour
cocoa powder
coffee
then the extraction result with llama3.1 405B is:
extracted_object:
id: 5feb617e-1866-4f3d-818f-c2cdf54a839e
label: My Shopping List
terms:
- FOODON:00002473
- FOODON:00004183
- AUTO:canned%20soup
- FOODON:00002515
- FOODON:03301116
- FOODON:03301072
- FOODON:03301036
named_entities:
- id: FOODON:00002473
label: apples
- id: FOODON:00004183
label: bananas
- id: AUTO:canned%20soup
label: canned soup
- id: FOODON:00002515
label: carrot cake
- id: FOODON:03301116
label: flour
- id: FOODON:03301072
label: cocoa powder
- id: FOODON:03301036
label: coffee
Great so far, but that's just English. In French, the input is:
Ma liste de courses
pommes
bananes
soupe en conserve
gâteau aux carottes
farine
poudre de cacao
café
And unsurprisingly, we get good extraction but poor grounding, because the source ontology (FOODON) is all English.
extracted_object:
id: d648384e-9b66-4789-9f2d-a731e6fcf614
label: Ma liste de courses
terms:
- AUTO:pommes
- AUTO:bananes
- AUTO:soupe%20en%20conserve
- AUTO:g%C3%A2teau%20aux%20carottes
- AUTO:farine
- AUTO:poudre%20de%20cacao
- AUTO:caf%C3%A9
named_entities:
- id: AUTO:pommes
label: pommes
- id: AUTO:bananes
label: bananes
- id: AUTO:soupe%20en%20conserve
label: soupe en conserve
- id: AUTO:g%C3%A2teau%20aux%20carottes
label: gâteau aux carottes
- id: AUTO:farine
label: farine
- id: AUTO:poudre%20de%20cacao
label: poudre de cacao
- id: AUTO:caf%C3%A9
label: café
So this time, if I include --system-message "Please translate the input text to English before performing any further operations."
along with the extract
command, I get good extraction and grounding:
extracted_object:
id: d353a257-2db0-4697-8c6f-7f62d0162874
label: My Shopping List
terms:
- FOODON:00002473
- FOODON:00004183
- AUTO:canned%20soup
- FOODON:00002515
- FOODON:03301116
- FOODON:03301072
- FOODON:03301036
named_entities:
- id: FOODON:00002473
label: apples
- id: FOODON:00004183
label: bananas
- id: AUTO:canned%20soup
label: canned soup
- id: FOODON:00002515
label: carrot cake
- id: FOODON:03301116
label: flour
- id: FOODON:03301072
label: cocoa powder
- id: FOODON:03301036
label: coffee
This instruction could also be included in the prompt itself, and that may be preferable if you want to retain some terms in their original language but translate others. There is also the option to include instructions in the prompt in the target (non-English) language, and that may help in some cases, but it may not improve the final results.
For reference, the above template is here: https://github.com/monarch-initiative/ontogpt/blob/main/src/ontogpt/templates/food.yaml (it's just named food
)
@caufieldjh Thank you very much! I will give a try.
Closing as we have a solution - feel free to open a new issue if any problems arise with non-English extractions.
Because I'd like to extract ontologies from non-English texts. Many thanks!