langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.34k stars 14.76k forks source link

RdfGraph schema retrieval queries for the relation types are not linked by the correct comment variable #8907

Closed LorenzBuehmann closed 10 months ago

LorenzBuehmann commented 1 year ago

System Info

langchain = 0.0.251 Python = 3.10.11

Who can help?

No response

Information

Related Components

Reproduction

  1. Create an OWL ontology called dbpedia_sample.ttl with the following:
    
    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    @prefix dcterms: <http://purl.org/dc/terms/> .
    @prefix wikidata: <http://www.wikidata.org/entity/> .
    @prefix owl: <http://www.w3.org/2002/07/owl#> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix prov: <http://www.w3.org/ns/prov#> .
    @prefix : <http://dbpedia.org/ontology/> .

:Actor a owl:Class ; rdfs:comment "An actor or actress is a person who acts in a dramatic production and who works in film, television, theatre, or radio in that capacity."@en ; rdfs:label "actor"@en ; rdfs:subClassOf :Artist ; owl:equivalentClass wikidata:Q33999 ; prov:wasDerivedFrom http://mappings.dbpedia.org/index.php/OntologyClass:Actor .

:AdministrativeRegion a owl:Class ; rdfs:comment "A PopulatedPlace under the jurisdiction of an administrative body. This body may administer either a whole region or one or more adjacent Settlements (town administration)"@en ; rdfs:label "administrative region"@en ; rdfs:subClassOf :Region ; owl:equivalentClass http://schema.org/AdministrativeArea, wikidata:Q3455524 ; prov:wasDerivedFrom http://mappings.dbpedia.org/index.php/OntologyClass:AdministrativeRegion .

:birthPlace a rdf:Property, owl:ObjectProperty ; rdfs:comment "where the person was born"@en ; rdfs:domain :Animal ; rdfs:label "birth place"@en ; rdfs:range :Place ; rdfs:subPropertyOf dul:hasLocation ; owl:equivalentProperty http://schema.org/birthPlace, wikidata:P19 ; prov:wasDerivedFrom http://mappings.dbpedia.org/index.php/OntologyProperty:birthPlace .


2. Run
``` python
from langchain.graphs import RdfGraph

graph = RdfGraph(
    source_file="dbpedia_sample.ttl",
    serialization="ttl",
    standard="owl"
)

print(graph.get_schema)
  1. Output
    In the following, each IRI is followed by the local name and optionally its description in parentheses. 
    The OWL graph supports the following node types:
    <http://dbpedia.org/ontology/Actor> (Actor, An actor or actress is a person who acts in a dramatic production and who works in film, television, theatre, or radio in that capacity.),
    <http://dbpedia.org/ontology/AdministrativeRegion> (AdministrativeRegion, A PopulatedPlace under the jurisdiction of an administrative body. This body may administer either a whole region or one or more adjacent Settlements (town administration))
    The OWL graph supports the following object properties, i.e., relationships between objects:
    <http://dbpedia.org/ontology/birthPlace> (birthPlace, An actor or actress is a person who acts in a dramatic production and who works in film, television, theatre, or radio in that capacity.),
    <http://dbpedia.org/ontology/birthPlace> (birthPlace, A PopulatedPlace under the jurisdiction of an administrative body. This body may administer either a whole region or one or more adjacent Settlements (town administration)), <http://dbpedia.org/ontology/birthPlace> (birthPlace, where the person was born)
    The OWL graph supports the following data properties, i.e., relationships between objects and literals:

Expected behavior

The issue is that in the SPARQL queries getting the properties the rdfs:comment triple pattern always refers to the variable ?cls which obviously comes from copy/paste code.

For example, getting the RDFS properties via

rel_query_rdf = prefixes["rdfs"] + (
    """SELECT DISTINCT ?rel ?com\n"""
    """WHERE { \n"""
    """    ?subj ?rel ?obj . \n"""
    """    OPTIONAL { ?cls rdfs:comment ?com } \n"""
    """}"""
)

you can see that the OPTIONAL clause refers to ?cls, but it should be ?rel.

The same holds for all other queries regarding properties.

The current status leads to a cartesian product of properties and all rdfs:comment vlaues in the dataset, which can be horribly large and of course leads to misleading and huge prompts (see the output of my sample in the "reproduction" part)

felixocker commented 1 year ago

Thanks for creating the issue. PR #9136 should fix this

I initially focused on RDF in the tests - it would be sensible to extend them to RDFS and OWL, though Would you like to give this a go, e.g., based on the example OWL file you used? If yes, I am happy to support/ review