Integration of sparql with large language model related functionality

fsasaki commented 9 months ago

Why

Several vendors are looking into this space already, sometimes in relation to extended (vector based) search capabilities, sometimes in relation to more general large language model features like summarization or knowledge graph generation from unstructured text.

Previous work

Franz (RDF) https://franz.com/agraph/support/documentation/current/neuro-symbolic-llm-intro.html , for vector indexing with LLMs see https://franz.com/agraph/support/documentation/current/llm.html
Neptune (property graph part) https://docs.aws.amazon.com/neptune-analytics/latest/userguide/vector-index.html
neo4j (property graphs) https://github.com/neo4j/NaLLM?tab=readme-ov-file
Related discussions in https://github.com/w3c/sparql-dev/issues/40 and https://github.com/w3c/sparql-dev/issues/163

Proposed solution

Nothing concrete yet, currently gathering related work.

Considerations for backward compatibility

Too early to discuss.

hartig commented 9 months ago

In the context of a tutorial that I gave a few years ago, I collected information about the full-text search features provided by several triple store vendors (BlazeGraph, Virtuoso, AllegroGraph, Stardog, GraphDB). The latest version of my slides with this information can be found at the following address, where slides 24 to 41 are the relevant ones. https://www.ida.liu.se/research/semanticweb/events/SemWebCourse2019/TripleStores.pdf

ktk commented 9 months ago

@hartig great one, tnx

VladimirAlexiev commented 9 months ago

GraphDB supports the following: https://graphdb.ontotext.com/documentation/10.4/gpt-queries.html

magic predicates to ask an LLM for text, list or table using data from your KG:
query explanation
result explanation, summarization, rephrasing, translation

https://graphdb.ontotext.com/documentation/10.4/retrieval-graphdb-connector.html

Indexing of KG entities in a vector database
Supports any text embedding algorithm and vector database. We've played with Weaviate, Elastic, etc
Uses the same powerful connector (indexing) language that we use for Elastic, Solr, Lucene
Automatic synchronization of changes in RDF data to the KG entity index
Supports nested objects (but not yet in the UI)
Serializes KG entities to text like this:
```
Franvino:
```
is a RedWine.
made from grape Merlo.
made from grape Cabernet Franc.
has sugar dry.
has year 2012.

https://graphdb.ontotext.com/documentation/10.4/talk-to-graph.html

A simple chatbot using a defined KG entity index

We are working on natural language querying (NLQ) aka knowledge graph question answering (KGQA). Cheers!

jpmccu commented 9 months ago

Interesting! I've started a plug-in for integrating vectors into SPARQL by using registered IRIs as defined vector spaces, and rdf:JSON literals as objects. Haven't made progress on the search side yet, but this is super relevant to many of our research projects.

Jamie McCusker (she/her/hers)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @. @.> http://tw.rpi.edu

On Thu, Dec 14, 2023 at 9:50 AM Vladimir Alexiev @.***> wrote:

GraphDB supports the following: https://graphdb.ontotext.com/documentation/10.4/gpt-queries.html

magic predicates to ask an LLM for text, list or table using data from your KG:

query explanation

result explanation, summarization, rephrasing, translation

https://graphdb.ontotext.com/documentation/10.4/retrieval-graphdb-connector.html

Indexing of KG entities in a vector database

Supports any text embedding algorithm and vector database. We've played with Weaviate, Elastic, etc

Uses the same powerful connector (indexing) language that we use for Elastic, Solr, Lucene

Automatic synchronization of changes in RDF data to the KG entity index

Supports nested objects (but not yet in the UI)

Serializes KG entities to text like this:

Franvino:

is a RedWine.

made from grape Merlo.

made from grape Cabernet Franc.

has sugar dry.

has year 2012.

https://graphdb.ontotext.com/documentation/10.4/talk-to-graph.html

A simple chatbot using a defined KG entity index

image.png (view on web) https://github.com/w3c/sparql-dev/assets/536250/80129475-5d92-451e-98c5-bc0d75960e6a

We are working on natural language querying (NLQ) aka knowledge graph question answering (KGQA). Cheers!

— Reply to this email directly, view it on GitHub https://github.com/w3c/sparql-dev/issues/193#issuecomment-1855991721, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEMZEA6QZC5KVGMSFDDYJMG5BAVCNFSM6AAAAABAU264N6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJVHE4TCNZSGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ericprud commented 9 months ago

@jpmccu , very cool. Any idea whether standardized value sets for vector spaces would allow "composition" of machines? Specific example: given a set of synaptic weights for diagnosing an ischemic stroke and another set for traffic patterns in a city, could one combine independently-trained machines in order to optimize stroke patient care (e.g. decide between close hospital or one further away that's good at angioplasty)? Sounds like you might be playing with stuff like that. Testing that in SPARQL would be very interesting indeed.

jpmccu commented 9 months ago

We assume that each vector space dimension is consistent (and is enforced before storage in the vector DB). One could concatenate vectors into a vector union in a new space, but we haven't really thought about doing multi-space comparisons.

Right now we just have the representation and a plug-in for whyis that intercepts the vectors as they're being published. We haven't done much more than brainstorm what the SPARQL would look like, beyond the BGPs for access looking like the RDF (using Jena PropertyFunctions) and ANN search using a PropertyFunction similar to the full text search module.

Jamie McCusker (she/her/hers)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @. @.> http://tw.rpi.edu

On Tue, Dec 19, 2023 at 8:44 AM ericprud @.***> wrote:

@jpmccu https://github.com/jpmccu , very cool. Any idea whether standardized value sets for vector spaces would allow "composition" of machines? Specific example: given a set of synaptic weights for diagnosing an ischemic stroke and another set for traffic patterns in a city, could one combine independently-trained machines in order to optimize stroke patient care (e.g. decide between close hospital or one further away that's good at angioplasty)? Sounds like you might be playing with stuff like that. Testing that in SPARQL would be very interesting indeed.

— Reply to this email directly, view it on GitHub https://github.com/w3c/sparql-dev/issues/193#issuecomment-1862785999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEPONWQBIVHXDUQTCALYKGK5BAVCNFSM6AAAAABAU264N6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRSG44DKOJZHE . You are receiving this because you were mentioned.Message ID: @.***>

VladimirAlexiev commented 9 months ago

But @jpmccu and @ericprud, is it appropriate to store tensors in JSON? Shouldn't we think of appropriate binary formats like HDF5 or stores like TensorStore? There are also Data Abstraction Layers (eg GDAL) to isolate data access from the specific binary format/storage used.

Under https://accordproject.eu/ (automated compliance checking of architectural designs and urban planning) we're thinking about a binary data connector for GraphDB.

There's also https://github.com/schemaorg/schemaorg/issues/3140

jpmccu commented 9 months ago

They aren't actually stored in JSON, just represented that way. And within my system, we can add loaders for any useful format. JSON is useful because it can be embedded in Turtle easily, and I was able to create an RDFlib handler for it that didn't require serialization and deserialization, so they remain Python objects when put in memory graphs.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Vladimir Alexiev @.> Sent: Wednesday, December 20, 2023 5:01:40 AM To: w3c/sparql-dev @.> Cc: Jamie McCusker @.>; Mention @.> Subject: Re: [w3c/sparql-dev] Integration of sparql with large language model related functionality (Issue #193)

But @jpmccuhttps://github.com/jpmccu and @ericprudhttps://github.com/ericprud, is it appropriate to store tensors in JSON? Shouldn't we think of appropriate binary formats like HDF5 or stores like TensorStorehttps://google.github.io/tensorstore/? There are also Data Abstraction Layers (eg GDAL) to isolate data access from the specific binary format/storage used.

Under https://accordproject.eu/ (automated compliance checking of architectural designs and urban planning) we're thinking about a binary data connector for GraphDB.

There's also schemaorg/schemaorg#3140https://github.com/schemaorg/schemaorg/issues/3140

— Reply to this email directly, view it on GitHubhttps://github.com/w3c/sparql-dev/issues/193#issuecomment-1864187058, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAETCELFJ6IBVY6AUXSJYXLYKKZQJAVCNFSM6AAAAABAU264N6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRUGE4DOMBVHA. You are receiving this because you were mentioned.Message ID: @.***>

fsasaki commented 9 months ago

Thanks a lot to all for the discussion so far. Let me try to structure this.

A lot of the discussion seems to focused on how to encode vectors and how to use them for search.

And there are existing issues like https://github.com/w3c/sparql-dev/issues/40 related to that.

There is nearly no discussion on capabilities to query LLMs and to generate graphs out of them. E.g. the magic predicates mentioned by @VladimirAlexiev or the ones from Franz I had mentioned at the top of this issue. Do people see a use case for these to be standardised?

Also, how about use cases that go beyond query but build on vector based similarity? One could use this for example for KG construction (which could be on top of queries via SPARQL CONSTRUCT) or validation ("check if everything which is skos:related is semantically really realted").

fsasaki commented 4 months ago

Some updates on this topic with newer developments.

@rdfguy mentioned in the KGC panel discussions on KG standards that the combination of symbolic and statistical reasoning would be potential future direction for graph technologies.

At the data week Leizip 2024, Lisa Wenige gave a 15 minute presentation on how this may look like, she showed sparql extensions for LLMs, see her 15 min presentation at https://www.youtube.com/watch?v=QfPCU8RiNhA&list=PLiyYYLqA8v5NBcAZJy6CpLVnDMrU4Y4yL&t=8344s

At the knowledge graph conference, LLM support was shown by nearly all knowledge graph vendors. A few steps which seem to be common for GRAPH RAG patterns are

Storing vectors for (parts of) a graph, see e.g. https://github.com/w3c/sparql-dev/issues/163
Providing vector generation capabilities. Common patterns seem to be: vector generation based on node descriptions or Concise Bounded Descriptions, or based on custom functions.
Provide similarity search based on vectors.

As pointed out previously in this issue, many of these topics are related to search. Now, there seem to functionalities beyond search, e.g.

Generate content using LLMs. Content can be textual content but also further graph structures
Validate based on statistical inferences, e.g. have SHACL constraints that a skos:matches relation can be justified by a statistical inference.

I am wondering if there is now a critical mass for starting work on this topic.

Both RDF and property graph vendors are quite active in this space now. My perception is that property graph vendors are more recognized in the communities that need such functionalities, esp. AI.
Waiting too long to pick this up for RDF may mean to loose the attention of e.g. AI developers who are now start to look into graphs.

ktk commented 4 months ago

As a quick reminder on how this group works: Everyone can pick up one of the topics and make a concrete proposal in the form of a SEP, see https://github.com/w3c/sparql-dev/tree/main/SEP.

But from experience I can say that it needs 1-2 people per SEP (at least) that really want to get it done and spend the time on it. We have a few successful examples when for example @afs and @Tpt created a SEP and worked on implementations after that in both Jena & Oxigraph.

fsasaki commented 4 months ago

@ktk thanks for the reminder. My question was meant to see if somebody wants to pick this (potentially jointly) up :)

afs commented 4 months ago

Interested!

There are several dimensions for SPARQL enhancements.

One part of this may be to work on the standardization of call-out extensibility.

Free text search is an example here. There is a common general sense of what a text search involves, while each text search system has particular features and syntax details. Therefore either define a (another!) free text search syntax or provide a flexible way to pass requests to text search systems.

What would be the requirements on a call-out interface to support LLM's? What about call-in?

6
64
40

w3c / sparql-dev