Support triple extraction use case

caufieldjh commented 4 months ago

In discussion with RNA-KG group (Marco Mesiti, Elena Casiraghi, Emanuele Cavalleri) and @justaddcoffee - we would like to be able to extract triples (s, p, o) from a provided text, using graph embeddings to guide the process. The goal is to find additional content for RNA-KG. Using OntoGPT has worked well for this so far but does not take advantage of the existing relations within the KG.

This would involve:

Including interface (CLI and/or GUI) to use text document as input
Providing way to index KGX and/or derive a schema from it
Building wrapper for graph embeddings.
- Using GRAPE directly through this project would be a heavy lift, so retrieving embeddings from an external source like Huggingface would likely work better, save time, and avoid introduction of many new dependencies
Writing documentation for the above

Integrating some process for comparison of the extracted triples would be ideal (e.g., A vs B appears in 20 documents, 15 of them from different sources, etc).

RNA-KG group has also suggested trying an alternative vector DB (https://www.llamaindex.ai/) to see if it works better for RAG with KG data.

cmungall commented 4 months ago

I'm not following the part about KG embeddings. I don't think we'd want a dependency on GRAPE here. But we want to support people providing their own embeddings e.g. via venomx. However I don't get how GRAPE/node2vec style embeddings would work with RAG.

Good suggestion to explore llamaindex. But I think this is orthogonal. See #34

justaddcoffee commented 4 months ago

Not sure what exactly Marco had in mind for using KG embeddings with RAG, but possibly something like read in abstracts that may contain relations of interest, do NER/ground to get IDs/CURIEs of interest from text, then pull these and any related nodes using KG embeddings and send along for context? Not sure

justaddcoffee commented 4 months ago

Also, agree that a GRAPE dependency might not be what we want here. I've made a (draft) PR #36 to support pulling embeddings from huggingface or any other URL

monarch-initiative / curate-gpt

Support triple extraction use case #33