What is the canonical way to write queries to knowledge graph

astroseger commented 3 months ago

Let's we have the following "knowledge graph"

(parent Tom Bob)
(parent Pam Bob)
(parent Tom Liz)
(parent Bob Ann)
(parent Bob Pat)
(parent Pat Jim)
(female Pam)
(male Tom)
(male Bob)
(female Liz)
(female Pat)
(female Ann)
(male Jim)

What is the canonical way to write the following request to this knowledge graph:

Who is parents of $x
Who is mother of $x
Who is sister of $x
Who is aunt of $x
etc...

(it would be great if aunt is somehow be defined via sister and parent_of)

It is possible to write the following implementation inspired by Prolog (@mvpeterson ):

(= (_parent $x $y) (match &self (parent $x $y) $x))
(= (_male $x) (match &self (male $x) $x))
(= (_female $x) (match &self (female $x) $x))

(= (_different $x $y) 
        ( if (== $x $y) 
            Empty
            $x
        )
)
(= (_sister $x $y) 
        ( let* ( ($q (_female $x)) 
                ($r (_parent (_parent $z $q) $y))
             ) 
             (_different $q $y) 
        )
)

(= (_aunt $x $y) (_sister $x (_parent $q $y)))

!(_aunt $x Jim)
; It will return correct answer [Ann]

But this implementation is extremely slow. So what would be the metta way here? @Necr0x0Der @vsbogd

astroseger commented 3 months ago

additional request: how to efficiently write the function "predecessor", which might be more difficult to wrap in a single match in comparison to "aunt".

The non-efficient, and rather slow implementation:

(parent Tom Bob)
(parent Pam Bob)
(parent Tom Liz)
(parent Bob Ann)
(parent Bob Pat)
(parent Pat Jim)
(parent Jim Lil)

(= (_parent $x $y) (match &self (parent $x $y) $x))
(= (_predecessor $x $z) (_parent $x $z))
(= (_predecessor $x $z) (_predecessor $x (_parent $y $z)))

!(_predecessor $x Lil)

noskill commented 3 months ago

This works much faster:

(= (_aunt $aunt $y) (match &self (, (parent $q $y)
                                 (female $aunt)
                                 (parent $p $q)
                                 (parent $p $aunt)) 

                                 (_different $aunt $q )
                                 )
                                 )

astroseger commented 3 months ago

This works much faster:

(= (_aunt $aunt $y) (match &self (, (parent $q $y)
                                 (female $aunt)
                                 (parent $p $q)
                                 (parent $p $aunt)) 

                                 (_different $aunt $q )
                                 )
                                 )

Yes. The single match should be much faster. The remaining questions are as following:

How to write _predecessor in this way?
Is it possible to somehow define "aunt" via "sister" and "parent"? To have some sort of composition?
How to most gracefully remove the unnecessary second argument for "aunt"?
And last but not least. Is it a canonical and proposed way to write such queries? I think this question is for @Necr0x0Der and @vsbogd .

noskill commented 3 months ago

@astroseger what is _predecessor?

mvpeterson commented 3 months ago

How to most gracefully remove the unnecessary second argument for "aunt"?

Why do you think it's unnecessary? How will you define who's aunt you're going to get without it? Or you mean the first argument?

astroseger commented 3 months ago

@astroseger what is _predecessor?

@noskill please see the second comment in this issue.

astroseger commented 3 months ago

How to most gracefully remove the unnecessary second argument for "aunt"?

Why do you think it's unnecessary? How will you define who's aunt you're going to get without it? Or you mean the first argument?

I meant that probably we should have one argument function get_aunt($x) which return the list of aunts for person $x.

mvpeterson commented 3 months ago

(= (_sister $x $y) ( let* ( ($q (_female $x)) ($r (_parent (_parent $z $q) $y)) ) (_different $q $y) ) )

The idea of using two arguments was to create a kind of relation between objects, i.e. a function should work in both ways:

;Who are Jim's aunts? (_aunt $x Jim)

;Who are Ann's nephews? (_aunt Ann $x)

Though it's not working in the current implementation

astroseger commented 3 months ago

(= (_sister $x $y) ( let* ( ($q (_female $x)) ($r (_parent (_parent $z $q) $y)) ) (_different $q $y) ) )

The idea of using two arguments was to create a kind of relation between objects, i.e. a function should work in both ways:

;Who are Jim's aunts? (_aunt $x Jim)

;Who are Ann's nephews? (_aunt Ann $x)

Though it's not working in the current implementation

Yes, I understand. But I think it is slightly misleading. For example what do we expect for _aunt(Bill_a_real_man_not_a_female $x) ? I think it is better to have two one argument functions.

mvpeterson commented 3 months ago

_aunt(Bill_a_real_man_not_a_female $x)

This should not reduce, or return Empty atom

astroseger commented 3 months ago

_aunt(Bill_a_real_man_not_a_female $x)

This should not reduce, or return Empty atom

I understand. But I think it is sligthly misleading. I propose to return to the main questions:

How to write _predecessor in this way?

Is it possible to somehow define "aunt" via "sister" and "parent"? To have some sort of composition?

How to most gracefully remove the unnecessary second argument for "aunt"? (@mvpeterson thinks it is not required)

And last but not least. Is it a canonical and proposed way to write such queries? I think this question is for Alexey and Vitaly.

mvpeterson commented 3 months ago

3. How to most gracefully remove the unnecessary second argument for "aunt"? (@mvpeterson thinks it is not required)

I would propose this

(= (_get_aunt $x) (match &self (, (parent $q $x) (parent $z $q) (parent $z $a) (female $a) ) (_different $a $q) ))

vsbogd commented 3 months ago

How to most gracefully remove the unnecessary second argument for "aunt"? (@mvpeterson thinks it is not required)

(= (_aunt $y) ...)

vsbogd commented 3 months ago

Is it possible to somehow define "aunt" via "sister" and "parent"? To have some sort of composition?

Nice idea! It is not possible at the moment. It can be emulated by introducing additional ...-query definitions and composing them in a final query. But it is not elegant and in this simple example looks excessive:

(= (parent-query $x $y) (parent $x $y))
(= (_parent $x $y) (match &self (parent-query $x $y) $x))
(= (_aunt $aunt $y) (match &self (, (parent-query $q $y) ...)

vsbogd commented 3 months ago

How to write _predecessor in this way?

There was plan to add "union" into query language then predecessor would look like (union (_parent $y $z) (_predecessor $x (_parent $y $z)))

vsbogd commented 3 months ago

And last but not least. Is it a canonical and proposed way to write such queries? I think this question is for Alexey and Vitaly.

In context of MeTTa "canonical" to me is a way to right down some knowledge in a way which allows converting it into any form which can be useful for the further inference. To me it looks like both implementations yours and @mvpeterson write the knowledge in a good enough way. Analyzing queries a bit simple than analyzing matches but difference is not critical to me.

astroseger commented 3 months ago

And last but not least. Is it a canonical and proposed way to write such queries? I think this question is for Alexey and Vitaly.

In context of MeTTa "canonical" to me is a way to right down some knowledge in a way which allows converting it into any form which can be useful for the further inference. To me it looks like both implementations yours and @mvpeterson write the knowledge in a good enough way. Analyzing queries a bit simple than analyzing matches but difference is not critical to me.

But the implementation with nested match (see first comment in this issue) is highly inefficient with the current metta interpreter. So I assume we should not put it into tutorial.

astroseger commented 3 months ago

How to write _predecessor in this way?

There was plan to add "union" into query language then predecessor would look like (union (_parent $y $z) (_predecessor $x (_parent $y $z)))

I'm not sure, but it seems it will have multiple matches as well, so it might have the same performance issue as the initial implementation (see the second comment to this issue).

In order to efficiently implement _predecessor we need some sort of graph walk. So very roughly speaking the complexity to go from person to his/her parents should be O(1), not O(N) as in implementation with multiple matches.

noskill commented 3 months ago

@astroseger

I rewritten the code into more compositional form, please check if it's good enough. Works reasonably fast on a KB with 100 families (5 seconds on my computer)

(: and-seq (-> Atom Atom Bool))
(= (and-seq $first $second)
    (if $first $second False))

(= (_parent $x $y) (unify &self (parent $x $y) True False))
(= (_male $x) (unify &self (male $x) True False))
(= (_female $x) (unify &self (female $x) True False))
(= (_mother $x $y) (and-seq
                            (_parent $x $y)
                            (_female $x)))
(= (_father $x $y) (and-seq
                            (_parent $x $y)
                            (_male $x)))

(= (_sister $x $y) 
    (and-seq
            (_female $x)
            (and-seq 
                (_parent $p $y)
                (and-seq (_parent $p $x)
                         (not (eq $x $y))
                )
            )
    )
)

(= (same $X $X) True)
(= (eq $X $Y)
    (let $C (same $X $Y) (if (== $C True) True False)))

(= (_aunt $aunt $y)
    (and-seq 
        (_parent $p $y)
        (_sister $aunt $p))
)

!(let $r (_aunt $x Jim0) (if $r $x (empty)))

noskill commented 3 months ago

predecessor predicate

(= (_predecessor $x $z) (_parent $x $z))
(= (_predecessor $x $z) (and-seq (_parent $p $z)
                                 (_predecessor $x $p)))

astroseger commented 3 months ago

@astroseger

I rewritten the code into more compositional form, please check if it's good enough. Works reasonably fast on a KB with 100 families (5 seconds on my computer)

(: and-seq (-> Atom Atom Bool))
(= (and-seq $first $second)
    (if $first $second False))

(= (_parent $x $y) (unify &self (parent $x $y) True False))
(= (_male $x) (unify &self (male $x) True False))
(= (_female $x) (unify &self (female $x) True False))
(= (_mother $x $y) (and-seq
                            (_parent $x $y)
                            (_female $x)))
(= (_father $x $y) (and-seq
                            (_parent $x $y)
                            (_male $x)))

(= (_sister $x $y) 
    (and-seq
            (_female $x)
            (and-seq 
                (_parent $p $y)
                (and-seq (_parent $p $x)
                         (not (eq $x $y))
                )
            )
    )
)

(= (same $X $X) True)
(= (eq $X $Y)
    (let $C (same $X $Y) (if (== $C True) True False)))

(= (_aunt $aunt $y)
    (and-seq 
        (_parent $p $y)
        (_sister $aunt $p))
)

!(let $r (_aunt $x Jim0) (if $r $x (empty)))

predecessor predicate

(= (_predecessor $x $z) (_parent $x $z))
(= (_predecessor $x $z) (and-seq (_parent $p $z)
                                 (_predecessor $x $p)))

This almost for sure will have the same performance issue as an initial code. Any implementation with multiply matches will have problem problems, if I understand correctly. @vsbogd am I right?

vsbogd commented 3 months ago

Any implementation with multiply matches will have problem problems, if I understand correctly. @vsbogd am I right?

I don't see any performance issue with Anatoly's code above. First query will find all parents, second fill find all parents of the parents. It is a graph traversal implementation.

astroseger commented 3 months ago

@vsbogd @noskill about performance issues. I think there is some misunderstanding. I'm not sure exactly how the match is implemented, but in the worst case each match in the query has O(N) complexity. If it is true then the query with 4 matches will have O(N^4) complexity. It is difficult to verify just because execution time is way too long. Let me demonstrate my point with experiments.

Let's take the following base:

(parent Pam0 Bob0)
(parent Tom0 Liz0)
(parent Bob0 Ann0)
(parent Bob0 Pat0)
(parent Pat0 Jim0)
....
(parent Tom9 Bob9)
(parent Pam9 Bob9)
(parent Tom9 Liz9)
(parent Bob9 Ann9)
(parent Bob9 Pat9)
(parent Pat9 Jim9)
(female Pam9)
(male Tom9)
(male Bob9)
(female Liz9)
(female Pat9)
(female Ann9)
(male Jim9)

So we simply take the initial knowledge base 10 times. We calculate the following expression

!(_aunt $x Jim0)

Let's calculate the execution time for three implementation of aunt function.

Initial implementation from @mvpeterson (https://github.com/trueagi-io/metta-examples/issues/40#issue-2183861018) - 24.610s
First Implementation from @noskill with one match (https://github.com/trueagi-io/metta-examples/issues/40#issuecomment-1994615594) - 0.499s
Second implementation from @noskill with compositionality but with mulitply matches (https://github.com/trueagi-io/metta-examples/issues/40#issuecomment-1999152522) - 36.029s

I'm afraid for large knowledge graphs the time for "1" and "3" will grow as O(N^3) (because it contains 3 matches).

@vsbogd I think it also explains why there is a performance issue with _predecessor. See also my comments in here, about what I think we need for efficient implementation of _predesessor https://github.com/trueagi-io/metta-examples/issues/40#issuecomment-1997715328

noskill commented 3 months ago

@astroseger

16 families = 0,942 32 families = 0m1,801s 64 families = 3,545s 128 families = 7,583s 256 families = 15,412s

So 2x data = 2x more computations, it's O(1) complexity, no?

astroseger commented 3 months ago

@astroseger

16 families = 0,942 32 families = 0m1,801s 64 families = 3,545s 128 families = 7,583s 256 families = 15,412s

So 2x data = 2x more computations, it's O(1) complexity, no?

What are you testing? From number it is your implementation with one match. I'm talking about problem with multiple matches.

noskill commented 3 months ago

I am testing on files generated with this code:

def generate_parents(N):
    for i in range(N):
        print(f"(parent Tom{i} Bob{i})")
        print(f"(parent Pam{i} Bob{i})")
        print(f"(parent Tom{i} Liz{i})")
        print(f"(parent Bob{i} Ann{i})")
        print(f"(parent Bob{i} Pat{i})")
        print(f"(parent Pat{i} Jim{i})")
        print(f"(female Pam{i})")
        print(f"(male Tom{i})")
        print(f"(male Bob{i})")
        print(f"(female Liz{i})")
        print(f"(female Pat{i})")
        print(f"(female Ann{i})")
        print(f"(male Jim{i})")

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('N', type=int, help='Size')
args = parser.parse_args()

N = args.N
generate_parents(N)
s = """

(: and-seq (-> Atom Atom Bool))
(= (and-seq $first $second)
    (if $first $second False))

(= (_parent $x $y) (unify &self (parent $x $y) True False))
(= (_male $x) (unify &self (male $x) True False))
(= (_female $x) (unify &self (female $x) True False))
(= (_mother $x $y) (and-seq
                            (_parent $x $y)
                            (_female $x)))
(= (_father $x $y) (and-seq
                            (_parent $x $y)
                            (_male $x)))

(= (_sister $x $y) 
    (and-seq
            (_female $x)
            (and-seq 
                (_parent $p $y)
                (and-seq (_parent $p $x)
                         (not (eq $x $y))
                )
            )
    )
)

(= (same $X $X) True)
(= (eq $X $Y)
    (let $C (same $X $Y) (if (== $C True) True False)))

(= (_aunt $aunt $y)
    (and-seq 
        (_parent $p $y)
        (_sister $aunt $p))
)

!(let $r (_aunt $x Jim0) (if $r $x (empty)))
"""

print(s)

noskill commented 3 months ago

I generated aunt2.metta ..aunt256.metta

python3 generate_parents.py 256 > aunt256.metta

and measured the execution time.

astroseger commented 3 months ago

I am testing on files generated with this code:

def generate_parents(N):
    for i in range(N):
        print(f"(parent Tom{i} Bob{i})")
        print(f"(parent Pam{i} Bob{i})")
        print(f"(parent Tom{i} Liz{i})")
        print(f"(parent Bob{i} Ann{i})")
        print(f"(parent Bob{i} Pat{i})")
        print(f"(parent Pat{i} Jim{i})")
        print(f"(female Pam{i})")
        print(f"(male Tom{i})")
        print(f"(male Bob{i})")
        print(f"(female Liz{i})")
        print(f"(female Pat{i})")
        print(f"(female Ann{i})")
        print(f"(male Jim{i})")

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('N', type=int, help='Size')
args = parser.parse_args()

N = args.N
generate_parents(N)
s = """

(: and-seq (-> Atom Atom Bool))
(= (and-seq $first $second)
    (if $first $second False))

(= (_parent $x $y) (unify &self (parent $x $y) True False))
(= (_male $x) (unify &self (male $x) True False))
(= (_female $x) (unify &self (female $x) True False))
(= (_mother $x $y) (and-seq
                            (_parent $x $y)
                            (_female $x)))
(= (_father $x $y) (and-seq
                            (_parent $x $y)
                            (_male $x)))

(= (_sister $x $y) 
    (and-seq
            (_female $x)
            (and-seq 
                (_parent $p $y)
                (and-seq (_parent $p $x)
                         (not (eq $x $y))
                )
            )
    )
)

(= (same $X $X) True)
(= (eq $X $Y)
    (let $C (same $X $Y) (if (== $C True) True False)))

(= (_aunt $aunt $y)
    (and-seq 
        (_parent $p $y)
        (_sister $aunt $p))
)

!(let $r (_aunt $x Jim0) (if $r $x (empty)))
"""

print(s)

It seems you are using old rust implementation. It is true that old metta does not have this issue. I'm not sure how it manages to execute multiple matches in linear time, maybe because of caching (which causes some problems by the way). But what I'm talking about is the minimal metta, sorry for misunderstanding.

astroseger commented 3 months ago

@noskill @vsbogd I agree that in principle the complexity of query with three nested matches might be only O(3*N) not O(N^3), for such a structure for a graph. But execution time tells another story.

It is execution times for the first implementation of aunt, for different number of families (numbers for size 10 were discussed here https://github.com/trueagi-io/metta-examples/issues/40#issuecomment-1999350917).

1 6.863
2 11.313
3 15.766
4 20.521
10 52.280
20 118.611
40 304.049
80 1311.03
160 3407.684

You see that it is not quite O(N^3) but not linear for large N as well. And N=160 is not a huge size of graph.

I'm talking about minimal metta. For rust version numbers are much smaller.

noskill commented 3 months ago

i got with minima metta:

16 10 32 26 64 75 128 248 256 1025

noskill commented 3 months ago

I fit a parabola with these numbers 14.15686 - 0.2071869x + 0.01622045 x**2

with x=512 estimation is 4160 and real time 4142

noskill commented 3 months ago

Also metta took 8GB of RAM on 512 families

astroseger commented 3 months ago

I fit a parabola with these numbers 14.15686 - 0.2071869x + 0.01622045 x**2

with x=512 estimation is 4160 and real time 4142

So good news that it is not O(N^3) :)

astroseger commented 3 months ago

@vsbogd @noskill I have a very surprising result. Now I can say that I completely do not understand the reason why the “_aunt” (and "_sister" as well) query takes so long. I've tried to measure time for recursive _predecessor defined in https://github.com/trueagi-io/metta-examples/issues/40#issuecomment-1994512403 . And it is not quadratic, it is not linear it seems to be close to O(1) ! So the performance problem is not directly caused by the nested match.

I have the following metta file (example for N=10):

(parent g1_0 g2_0)
(parent g2_0 g3_0)
(parent g3_0 g4_0)
(parent g4_0 g5_0)
....
(parent g1_9 g2_9)
(parent g2_9 g3_9)
(parent g3_9 g4_9)
(parent g4_9 g5_9)

(= (_parent $x $y) (match &self (parent $x $y) $x))
(= (_predecessor $x $z) (_parent $x $z))
(= (_predecessor $x $z) (_predecessor $x (_parent $y $z)))

!(_predecessor $x g5_0)

The execution time for different N:

1  1.351
10 1.358
100 1.357
1000 1.562
10000 4.033
100000 33.215

If we calculate the time for simply reading metta file with N=100000 (by replacing the call of _predessesor with print) we will get 28.968...

Adam-Vandervorst commented 3 months ago

Some real ancestor graphs

I took the JSONs, converted them in a generic fashion to MeTTa files, used a MeTTa script to convert that into the above format, added @astroseger implementation, and added some simplified implementations: metta-examples/aunt-kg

Note this is similar to my work in metta-examples/traverse

This basic stuff also works in FormalMeTTa: FormalMeTTa/aunt-kg

CZ2 is built for this kind of stuff, and it executes every query on every person in the popular royal92 ancestral graph in 100ms. It took some more effort to write the queries, but it is able to mimic custom graph databases with optimal traversals, and in theory, we can generate that from a match statement: CZ2 spacewide ops code and complexity analysis CZ2 benchmark results

vsbogd commented 3 months ago

I've tried to measure time for recursive _predecessor defined in #40 (comment) . And it is not quadratic, it is not linear it seems to be close to O(1)

The time of the match doesn't depend on the size of the base in this simplistic case when query result can be found using prefix. Thus it is expected that time of the _predecessor doesn't depend on a size of irrelevant data but depend on a number of predecessors in a family tree.

Complexity of the original example https://github.com/trueagi-io/metta-examples/issues/40#issue-2183861018 is big because of complexity of the _sister:

(= (_sister $x $y) 
  (let* (
    ($q (_female $x))
    ($r (_parent (_parent $z $q) $y))
  ) (_different $q $y) ))

It is called with assigning specific value to $y. First condition in let is (_female $x) it means we are collecting all females from knowledge base and it gives us * N to complexity. Then (_parent $z $q) gives us * 2, and finally we checking all parents of all females against parents of the person under the question.

For example re-writing it in this way:

(= (_parent $x $y) (match &self (parent $x $y) $x))
(= (_child $x $y) (match &self (parent $x $y) $y))
(= (_female $x) (match &self (female $x) $x))
(= (_different $x $y) ( if (== $x $y) Empty $x))
(= (_sister $x $y)
  (let* (
    ($r (_parent $_ $y))
    ($q (_child $r $__))
    ($q (_female $q))
  ) (_different $q $y) ))

!(_aunt $x Jim0)

gives us complexity which depends only on a number of siblings of $y:

N=1 time=2.639s
N=10 time=2.634s

vsbogd commented 3 months ago

CZ2 benchmark results

@Adam-Vandervorst do you have benchmark results for the sergey_rodionov_formulation.metta on the databases of different sizes?

vsbogd commented 3 months ago

I really appreciate idea to represent such metta code as a composition of the queries and thus be able to optimize it on the atomspace level. In general to make querying code more effective one should prefer the order of queries which decreases amount of data to be processed after each query. E.g. _child | _female instead of _female | _child.

Adam-Vandervorst commented 3 months ago

@vsbogd I let it run for over a day, but then I canceled the HE benchmark of royal92. On the Lord of the Rings graph, HE took 55 minutes to complete the basic_formulation, IIRC. I was running these tests manually, but I will slightly change the tests so they're easier to do automatically.

Specifically, I was looping over all people and all queries, but I can rewrite it in a compound query that says (= (person) (match &self (male $x) $x)); (= (person) (match &self (female $x) $x)); and chain that into the different ancestor relations.

vsbogd commented 3 months ago

For example re-writing it in this way:

One more suggestion for the code in the description of the issue. Using unify one can use single predicate _parent to return both parents and children via arguments. For example _aunt version using unify:

(= (_parent $x $y) (unify &self (parent $x $y) $x Empty))
(= (_female $x) (unify &self (female $x) $x $Empty))
(= (_different $x $y) ( if (== $x $y) Empty $x))
(= (_sister $x $y) ( let* (($r (_parent $r $y)) ($r (_parent $r $q)) ($q (_female $q)) ) (_different $q $y) ))
(= (_aunt $x $y)  (_sister $x (_parent $q $y)))

!(_aunt $x Jim0)

vsbogd commented 3 months ago

@vsbogd I let it run for over a day, but then I canceled the HE benchmark of royal92. On the Lord of the Rings graph, HE took 55 minutes to complete the basic_formulation, IIRC.

Thanks @Adam-Vandervorst , I asked because it is interesting if CZ2 can optimize complexity in case of ineffective query like _female filtered by _parent. I think the simplest way to find out is to run examples provided by @astroseger and see how time changes with N growing.

noskill commented 3 months ago

indeed moving (_female $x) to the last condition helps:

(= (_sister $x $y) 
    (and-seq
            ;  (_female $x)
            (and-seq 
                (_parent $p $y)
                (and-seq (_parent $p $x)
                         (not (eq $x $y))
                )
            )
            (_female $x)
    ))

Adam-Vandervorst commented 3 months ago

Working on it @vsbogd ! The intuition for why CZ2 is able to optimize (beyond just the complexity, also taking into account the effective occurrences) is that it can access the counts of prefix compressed occurrences. This is definitely not what differentiates it. In fact, a good optimization for HE would be to exploit the Trie index to provide the same knowledge. E.g. getting the total count of the star * in (female *) is "free".

Adam-Vandervorst commented 3 months ago

Three benchmarks are still running, but here are the results that are already in @vsbogd https://github.com/Adam-Vandervorst/metta-examples/tree/main/aunt-kg

trueagi-io / metta-examples

What is the canonical way to write queries to knowledge graph #40