opencog / atomspace

The OpenCog (hyper-)graph database and graph rewriting system
https://wiki.opencog.org/w/AtomSpace
Other
812 stars 228 forks source link

Stop using SetLink for search results! #1502

Closed linas closed 2 years ago

linas commented 6 years ago

SetLink has a number of technical issues with respect to the core concept of atoms and how atoms should be used. Basically, SetLink is a bad citizen. The only answer I can find is to simply excommunicate, banish SetLink from the Kingdom. The correct solution is to use MemberLink instead. (See however, other ideas below.)

Unfortunately, BindLink and GetLink, when executed, return SetLink's containing the search results. This needs to be replaced by MemberLink, and specifically, I think the following form might work best:

MemberLink
    SomeSearchResultAtom
    AnchorNode "pattern-9a30bd4cbdcd6"

and then have cog-execute! return the AnchorNode -- all search results can then be found, attached to this anchor.

Q: Why is SetLink bad? A: See the wiki page for it. https://wiki.opencog.org/w/SetLink

Q: Why AnchorNode? A: It seems like the best kind of node for this job.

Q: What other benefits (besides avoiding SetLink)? A1: Search results can be reported incrementally, as the search proceeds, instead of waiting for one big glob to appear at the end.

A2: The SetLinks that get returned glom up the atomspace, and people usually forget to delete them. They are hard to delete - you have to actually have them in your hand to delete them - searching for them later on is really hard or impossible. By contrast, finding and deleting the AnchoreNodes is easy, and can be done at any time.

A3: Its a baby-step to converting the AtomSpace itself to a special kind of MemberLink and/or ContextLink. Having to pass the AtomSpace around in various methods is a real pain, and causes a lot of programming issues. It would be nice to just get rid of the atomspace entirely, and instead declare membership as:

AtomSpaceLink
    AtomSpaceNode "primary atomspace"
    SomeAtom

This or something like this could solve a lot of technical problems, and nuking SetLink is a natural first step.

I believe that this will simplify creating a distributed atomspace. We could layer everything on top of (for example Redis or Riak and have that do all the work of distributing atoms. I don't believe this any longer. The StorageNode is a superior solution for distributed atomspaces.

A4: If we do the above or something like it, we could merge MapLink/BindLink, since they both do almost the same thing, just that MapLink does it for its arguments, while BindLink does it for the entire atomspace. Similarly GetLink/FilterLink...

Q: Is this a lot of work? A: You bet. BindLinks are used all over the place.

ngeiswei commented 6 years ago

👍 coincidentally I could use "the AtomSpace itself to a special kind of MemberLink" for implementing an elegant as well as efficient URE-based pattern miner (EDIT: along with outputting pattern matcher results with member links, of course).

linas commented 6 years ago

List of distributed DB alternatives:

Ignite vs Redis vs Riak: https://db-engines.com/en/system/Ignite%3BRedis%3BRiak+KV

After a super-quick skim, Ignite seems to have a better feature-set, Redis is more popular and thus stable.

This now appears to be off-topic. The StorageNode provides a superior solution. See the wiki entry for it.

linas commented 6 years ago

See also #1507

linas commented 5 years ago

See #1967 for details on AtomSpaceLink

linas commented 5 years ago

To avoid cluttering the atomspace with "useless" anonymous SetLinks, one could implement a SetLink so that, when it has no incoming links, i.e. has been orphaned, then it is also removed from the atomspace. This will avoid a lot of the general clutter associated with anonymous SetLinks (they are anonymous, because they are un-named. Since there is no name, there is no way to refer to them. If they don't have any incoming set, then they really are unreferenced, and thus kind-of dead.)

noskill commented 5 years ago

It is possible to pass child atomspace to execute_atom thus avoiding cluttering

linas commented 5 years ago

There's no mechanism for that. If one has (cog-evaluate! (SequentialAnd ... (Put .... (State ... (Get ... (SequentialOr ... (Delete ... (Bind ... there's no way to fiddle in some other atomspace into that sequence. If the top-level SequentialAnd is tail-recursive, its an infinite loop, and never returns. There's no chance to sneak in there and insert some other atomsapce, or clean things up.

linas commented 5 years ago

Just to be clear: the above example actually is "realistic"; its how the robot behaviors were being done some years ago. Its actually quite natural, and fits well with time-line editors, behavior trees, drag-n-drop GUI's etc. You just "connect the dots" to get things done, and all of that nesting was just a way of flowing data through the graph.

Now that we have Values, we have an alternative way of flowing data through the graph...

noskill commented 5 years ago

There is mismatch between scheme and python api. Python api requires to pass atomspace to cog-execute and cog-evaluate. We are using it a lot. I made context manager to conveniently create and set as default child atomspaces to define query in temporary atomspace too: https://github.com/singnet/semantic-vision/blob/2681f93cf12ee0e5c1a53faef28a815d9e9c857c/experiments/opencog/cog_module/module.py#L138-L148

linas commented 5 years ago

The python api is abandonware and I'm exhausted trying to fix the infinite number of bugs in it. I want to punch the people who created it, they have caused me huge amounts of pain.

You can certainly do temporary work in child atomspaces - that is what they were created for. However, at some point, you have to move the desired results back into the main atomspace. (since the main atomspace is what you are saving to disk, when you power-off; its also typically the atomspace you share with other machines, when doing distributed processing.) ... In principle you could save child atomspaces also, but that is currently not implemented in the postgres backend. (and since the distributed atomspace depends on postgres, you also cannot have a distributed-child-atomspace)

Anyway, the whole point of the (cog-evaluate! (SequentialAnd ... (Put .... (State ... (Get ... (SequentialOr ... (Delete ... (Bind ... example is to remind you that, in general, you should NOT be coding in python. And you should NOT be coding in scheme. You should, in general, think "how can I represent my data flow with atoms, so that all data and flows and processing happen in the atomspace". or even better: "how can I represent my data flow with atoms, in such a way that PLN and the chainers, and the pattern miner and moses can understand my data flow, learn it, adjust it, alter it". Since PLN, the chainers, the pattern miner don't know how to read/write python code, any python code that you write is "invisible" or "inaccessible" to these subsystems.

noskill commented 5 years ago

Just to be clear: the above example actually is "realistic"; its how the robot behaviors were being done some years ago.

Do you still use this approach? Is there example of implementing main loop in atomese? I would certainly look more into it to use this approach in the future.

Code that i linked is intended to be released for general machine learning community. It is kind-of code generator for building neural networks during inference along the lines of https://github.com/opencog/atomspace/issues/1970. It's quite hard to write in atomese, at least harder than in general purpose programming language. We don't even have a language tutorial yet. That's why main loop is in python and not in atomese. There is example of training on sum of two digits from mnist https://github.com/singnet/semantic-vision/blob/2681f93cf12ee0e5c1a53faef28a815d9e9c857c/experiments/opencog/cog_module/mnist.py atomese describes probability of getting correct sum.

bottom line I think code generation is the way to use atomese, at least for wider audience. And perhaps i can make something like scan operator in theano http://deeplearning.net/software/theano/library/scan.html to make loops over datasets.

noskill commented 5 years ago

It seems we need atomese operators to manipulate atomspaces e.g. create and delete child atomspaces. Then it will be easier to port some code from python/schema to atomese

linas commented 5 years ago

Do you still use this approach?

Yes. Hit bug opencog/opencog#3433 only a week ago, when I broke tail recursion during the cleanup.

Is there example of implementing main loop in atomese?

Yes. /examples/atomspace/recursive-loop.scm

linas commented 5 years ago

Code that i linked is intended to be released for general machine learning community. It is kind-of code generator for building neural networks during inference along the lines of #1970.

Well, as I stated repeatedly there, I think that using the grounded schema node is a deep and fundamental mistake. Last summer, during very long and detailed email conversations with @Necr0x0Der Alexey Potapov, we worked out, in considerable detail, exactly how to map neural nets into the atomspace. I was rather unhappy that you decided to throw these plans away, without any discussion, without any warning, for no particular reason, and create something completely new, different and incompatible with any sort of long-term roadmaps. I still do not understand why the original plan for neural nets was thrown away. I think it was a decent plan, a good plan. And I'm still deeply unhappy with the GroundedSchema design, for all the reasons that I already explained.

It's quite hard to write in atomese, at least harder than in general purpose programming language.

Atomese is not supposed to be written by human beings. That's the whole point. Only code generators should read and write Atomese. Not humans.

This means that all of the code generators need to be compatible with each-other. The original neural-net design was compatible with the existing code generators (the chainers, the miner, PLN, openpsi, ghost) The grounded-schema design is incompatible with the code generators.

We don't even have a language tutorial yet

For which language? There are 27 tutorials for atomese in /examples/atomspace and 25 more in /examples/pattern-matcher If you actually go through and study all fifty tutorials for Atomese, you shoul have a pretty good idea of how things work. If you skip the tutorials .. then, of course, things will be confusing.

linas commented 5 years ago

It seems we need atomese operators to manipulate atomspaces e.g. create and delete child atomspaces. Then it will be easier to port some code from python/schema to atomese

Can you open a new issue to discuss this topic? I don't understand why this is needed.

noskill commented 5 years ago

https://github.com/opencog/atomspace/issues/2105 - new issue

linas commented 5 years ago

OK, I'm thinking that now is a good time to move forward on this. (Yes, @noskill the temp-atomspace idea works too; but there are also other reasons to change the API too -- e.g. parallelism) I think I found a simple solution. Create a new link, call it QueryLink (or SearchLink ??) that is almost identical to the current BindLink. The difference is that it returns nothing.

To provide backwards-compat, every (BindLink P Q) will be treated as (QueryLink P (MemberLink Q (AnchorNode "results"))). Then, when the QueryLink completes, the results will be gathered up in a SetLink and returned in backwards-compat fashion.

New users will be encouraged to use QueryLink instead of BindLink

By running QueryLink in it's own thread i.e. (ParallelLink (QueryLink ...)) the results will accumulate in real-time, so the user does not have to wait for all results before starting work on the first one... If user wants to know when all results have been found, then (JoinLink (QueryLink ...)) can be used.

noskill commented 5 years ago

examples and language tutorial/introduction are different things. Otherwise you would name that directory "tutorial" and not examples. I personally look for information in opencog wiki, and update it sometimes.

And actually our design is somewhat compatible with pln. I need to rewrite Implication and fuzzy conjunction rules yet. This in turn will give up learnable weights for links. Currently there is not way to backpropagate error to these weights.

linas commented 5 years ago

examples and language tutorial/introduction are different things

The examples directory is the de-facto best, clearest and most complete language tutorial. We could trivially rename that directory "tutorials", if that makes you happier. The "tutorials" on the wiki are in terrible condition. First, out of the 50-or-so examples/tutorials in the examples directory, the wiki pages cover maybe 3 or 4 of them. Those were written by someone who was trying to learn the atomspace for the first time ever. And worse -- they had no long-term plan to actually do anything with the atomspace, so no motivation to make the tutorials "meaningful", illustrating how certain kinds of problems can be solved in certain ways in the atomspace. The net result is that the quality of the "tutorials" on the wiki is very low.

linas commented 5 years ago

our design is somewhat compatible with pln.

You've consistently evaded answering the question: why did you not follow the original design? Why did you create a completely different design? There would seem to be two reasons: (1) someone told you to explicitly not follow the original design, and to do it differently, or (2) no one provided you any advice or guidance at all, threw you in the deep end, and hoped you learned how to swim.

Right now, I'm thinking it's option (2), which is a bit unfortunate.

noskill commented 5 years ago

I know about original design only from your description in issue https://github.com/opencog/atomspace/issues/1970. Not sure if Alexey knows more, he said that there was some discussion concerning task of semantic retrieval https://arxiv.org/abs/1806.06946.. Currently we have a bit different desiderata e.g. backpropagation of error to link weights, which leads to different design choices.

Anyway now i think it's too early to make some final design. It's better to keep opencog easily extendible, so that we could easily try out different ways to integrate differentiable models with opencog. You said that @misgeatgit is working on the implementation of your design. Is there any progress?

linas commented 5 years ago

Hmm. OK, not sure where to go from here.

keep opencog easily extendible,

I want to keep "opencog" and "atomspace" distinct. For opencog, anything goes, I don't really care. For the atomspace, I want to keep it far more focused and under control. Like, when you're on an airplane, they don't let you into the cockpit to randomly experiment with turning knobs and pulling levers.

For the atomspace, I really want to think of it as a "knowledge representation system", and explore the best ways of doing that. The biggest recent innovation is making a clear-cut distinction between atoms and values. The atoms represent graphs with "the usual" term-algebra-type-theoretic trimmings. They're necessarily a bit slow and bloated for what they do. The "values" are meant for fast, rapidly-changing data.

The key idea of values was to do computation-but-not-in-the-atomspace. Thus, some graph, held in the atomspace, represents the "general knowledge of what we know about the world" -- for example, how different layers in some neural net are hooked to one-another. (which is "knowledge" that is approximately static and non-changing) The actual computations -- the back-propagation, the hill-climbing -- all that happens on some GPU's, somewhere else. Not in the atomspace. If you want to take a sample of what the neural net is doing, a kind of "what's been happening recently?" you'd do it with a Value. The value would have some code that knows how to talk to some GPU or whatever (or talk to tensorflow, etc.) and say "what are you doing? Give me your latest data"

A simpler example would be a live camera video stream -- we are never going to be shoving pixel data into the atomspace. But if, for some reason, you need to ask, "what is the color of pixel (539, 847)?" you'd go to (ConceptNode "my video camera") and then use the key (PredicateNode "raw pixel feed") to find a FloatValue that has "color of pixel (539, 847)" in it. As long as you do this less than a few (ten-)thousand times a second, it can work. That's because the 50-mega-bit-per-second video-feed is happening somewhere else, not in the atomspace. The Values are like URL's to find that data, when needed.

The example /examples/atomspace/stream.scm was created to explicitly show how to do that.

What Misgana was doing/partly did was to implement this idea for 3D positions of physical objects in the spacetime server. So, 3D data comes from some external source: typically, some ROS node hooked up to video cameras doing some kind of object detection/recognition/whatever. The resulting data stream, now filtered down to 60-frames-per-second of floating-point (x,y,z) data for a few dozen/few hundred objects -- that data goes into the spacetime server (its still too much data for the atomspace, the bandwidth is still too high.). But that data can be queried with Values.

So, if you ask the robot "Is Anatoly standing to the left, or to the right of Vitaly?", that sentence gets converted into a bunch of Atoms, and somewhere there is a (ReferenceLink (Concept "Anatoly") (Concept "object 42 in the spacetime server")) (this is pseudo-code), There are also some atoms like (EvaluationLink (PredicateNode "is to the left of") (Variable X) (Variable Y)) that Misgana has created. When evaluated, they obtain FloatValue of x,y,z position data from the space server, do some math, and return true/false. The natural language subsystem converts this back to a spoken answer.

Since you are only asking "Is Anatoly standing to the left, or to the right of Vitaly?" no faster than once every few seconds, the compute power needed to do this is reasonable. You only need to evaluate (PredicateNode "is to the left of") once every few seconds...

OK, its more complicated than that, and I skimmed over the details, so the above is kind-of wrong, but it should give a flavor. I'm not sure where the detailed design is, just right now. The general point is that the atomspace knows language, and knows things about (Concept "Anatoly") but it is not trying to do 60-frames-a-second updates of Anatoly's physical position, right now. The Values do that. I was hoping that we would do something similar with the neural nets

Yes, some (many?) of the design details are not clear. I was hoping to be able to discuss and clarify these over time. Yes, the atomspace can mostly almost do this now (see the /examples/atomspace/stream.scm example) but more sophisticated prototypes are needed. Yes, we will need to fix/enhance/change things in the atomspace to get this working right. But, just like the airplane cockpit, I don't want random uncontrolled experiments in the atomspace.

Anyway, all this belongs in some other issue, not this one.

linas commented 5 years ago

concerning task of semantic retrieval https://arxiv.org/abs/1806.06946.. Currently we have a bit different desiderata e.g. backpropagation of error to link weights, which leads to different design choices.

Can you open a distinct issue to discuss this?

noskill commented 5 years ago

Thanks Linas, we'll get in contact with Misgana for knowledge sharing.

noskill commented 5 years ago

I agree on what are you saying about distinction between values and atoms. We did not use any new atom types yet. Vitaly implemented ptrvalue, so that pyobject can be attached to atom. Everything neural network related is stored in values. This value type is incompatible with other atomese datatype, but that is the price of backpropagation. It can be converted to float if needed.

linas commented 5 years ago

that is the price of backpropagation

Well, I think you understand what I've been trying to say. Perhaps its time for me to review the ptrvalue thing again. Are you saying that you're actually doing backpropagation in the atomspace itself ? I was assuming that you're just wrapping code that does computation elsewhere (i.e. on gpu's) anyway, we should have this discussion on some other github issue, or mailing list, not here.

noskill commented 5 years ago

All that is discussed here is related to the issue https://github.com/opencog/atomspace/issues/1970, i'll update it instead of creating new one

stellarspot commented 5 years ago

I just want to consider the option where SetLink is replaced by MemeberLink with AnchorNode. Here is an example which finds nodes using GetLink and call LambdaLink using PutLink:

(Inheritance (Concept "B") (Concept "A"))
(Inheritance (Concept "C") (Concept "A"))
(Inheritance (Concept "D") (Concept "E"))

(display
 (cog-execute!
  (Get
   (Inheritance (Variable "$X") (Concept "A")))
  ))

(display
 (cog-execute!
  (Put
   (Lambda
    (Variable "X")
    (Inheritance
     (Variable "X")
     (Concept "W")))
   (Get
    (Inheritance (Variable "$X") (Concept "A"))))))

The current result for GetLink is:

(SetLink
   (ConceptNode "B")
   (ConceptNode "C")
)

And the result of PutLink is:

(SetLink
   (InheritanceLink
      (ConceptNode "C")
      (ConceptNode "W")
   )
   (InheritanceLink
      (ConceptNode "B")
      (ConceptNode "W")
   )
)

If I understand correctly the new proposed design suggests: GetLink:

MemberLink ConceptNode "ball-3" AnchorNode "pattern-#12345"



What is expected behavior of PutLink in this case?
Should it look at the argument and if it is  AnchorNode then retrieve the first atom from MemberLink outgoing set with the given AnchorNode and use it as input?

Could such functionality be added in two steps: 
1. introduce something like ParallelGetLink and ParallelPutLink which do not use SetLink at all 
1. when it works replace  GetLink and PutLink by their new implementation 
linas commented 5 years ago

AnchorNode

Yes, more or less. AnchorNode is also an imperfect solution; thus I'm interested in a publish-subscribe system. #1750 which is meant to be better than AnchorNodes. One problem with Anchor nodes is that if there are two users (two threads) each grabbing one atom (dettaching it from the Anchor Node, and then "processing" it) then there are race conditions where the two threads might both get the same atom. Maybe we can fix this with a LockAnchorNode ? How would that work? What if I want the threads to sleep, waiting for data to arrive at the LockAnchorNode? Do I need a SemphoreNode?

There are other possible solutions: maybe the results of a Get/Bind should be attached to a Value, somewhere, instead of a MemberLink. So maybe have a ResultsValue that provides locking, semaphores, messaging? Or is that a bad idea?

What is expected behavior of PutLink in this case?

Backwards-compatible, in some way. It should "do the right thing", whatever that is. It is not always clear what the "right thing" is, and so some careful thinking and design and prototyping is needed. I like things that are small, simple, and have little or no negative impact on the performance of existing systems. I don't want 99% of the users to run slower, just so that 1% of the users can use some new whiz-bang feature.

Introduce something like ParallelGetLink and ParallelPutLink

Maybe. If you are really interested in this, put together a detailed design (in some other github issue) that shows exactly how this might work. I'm kind-of interested in solutions that solve big problems -- i.e. not just get/put, but also bindlink, also the queueing/dequeueing problem. The SetLink alone is itself not all that terrible -- its really a symptom of the bigger problem of "how does the data flow around?" that needs to be solved, that needs to have better solutions.

stellarspot commented 5 years ago

I just want to summarize problems related to SetLink to have a deeper view on this question. GetLink and PutLink are just used below as examples of systems which produce and consumer some results.

  1. Problems with SetLink as a set concept. It is difficult to add/remove elements to/from SetLink and find a set by pattern (present/absent elements).

Possible solution could be using MemberLink. For example using AnchorNode:

MemberLink
    ConceptNode "ball-1"
    AnchorNode "pattern-#12345"

or a set can be marked by some specialized node like SetNode:

MemberLink
    ConceptNode "ball-1"
    SetNode "set-1"
  1. GetLink can generate results in parallel. PutLink can use results in parallel. There are several dimensions from which the problem can be viewed.

Control flows:

PutLink can work as busy loop which checks for new results or this can be implemented as publisher/subscriber system where GetLink notifies subscribers that a new result is available

Results of GetLink can be stored in AtomSpace as MemberLink or it can be internally implemented program objects which are directly used by PutLink.

Possible solution where results are stored in MemberLinks: GetLink can have SetNode as a parameter where to store results:

GetLink
  SetNode "S"
  InheritanceLink
    VariableNode "X"
    ConceptNode "A"

results:

MemberLink
    ConceptNode "a-1"
    SetNode "S"

PutLink uses SetNode as argument to know where to retrieve results from:

PutLink
  LambdaLink
    ...
  SetNode "S"

(Just a curious question. Why PutLink mentions its argument as the last atom after the body? Is not more naturally to place the argument after the variable list and before the body? )

SetNode can contain its control information as values:

When GetLink finishes it sets SetNode "done" value to true. GetLink reads "canceled" value from SetNode and if it is true stops the execution. When new value is generated GetLink notifies one or all subscribers from "subscribers" value.

MemberLink has "lock" value. If someone needs to delete an element of the set it needs to lock the MemberLink link to not allow to read/delete it by other subscribers. If someone needs to read a set value it must obtain the read lock form the MemberLink has "lock" value.

Another option is to not put results of GetLink into AtomSpace. In this case different program primitive like promise/future, queue or publish/subscribe systems can be used.

  1. AtomSpace cleanup. One of the aim to get rid of SetLink is that it pollutes the AtomSpace and it is difficult to clean it up. It looks like it is a general problem not only related to SetLink. For example ListLink can pollute the AtomSpaces as well:
    PutLink
    LambdaLink
      ...
    ListLink
    Concept "A"
    Concept "B"
    SomeLink ...

There should be a unified way to remove links which are used only to pass results from one place to another.

May be some ArgumentLink which means that PutLink passes values from ArgumentLink to LambdaLink and then removes the LambdaLink.

  1. Description of PutLink declares that it is a notation for beta redex. It can mean that PutLink first puts its arguments to body and makes execution after that.

The current PutLink implementation first executes its arguments and then puts it to body. It seems as an internal property of PutLink. New design of Get/PutLink which uses MemberLinks can't avoid it. The reason is that it is not clear before substitution if result of PutLink argument execution contains only one element or many. So it is necessary to execute the PutLink argument first.

In this case PutLink description should be revisited so it is not just a beta redex.

linas commented 4 years ago

FYI: please note: issue #1750 describes both a QueueValue and a publish-subscribe system for distributing processing results.

linas commented 4 years ago

Pull req #2500 implements the initial idea discussed at the very top, that began this long conversation. It's simple (very simple, actually) and it works. ...

... but in some senses, its incomplete. The above discusses the need for a QueueValue where search results could be placed on a queue, and/or some kind of promises/futures system, that would allow search results to be pulled from the pattern matcher as needed. This has not been done. However, I think these are now quite doable: they can be slotted into the code in the same way that the AnchorLink is slotted in with pull req #2500

To provide more details: the pull req #2500 allows this:

(define get-link
   (Get
      (VariableList
         (TypedVariable (Variable "$x") (Type 'ConceptNode))
         (Anchor "get-results"))
      (Present
         (Evaluation (Predicate "foo")
            (List (Variable "$x") (Concept "alpha"))))))

and the search results will be attached with MemberLinks to the specified AnchorNode. In the same way, one could instead have a FIFONode or a QueueNode or a PromiseNode, instead ot the AnchorNode above, and search results could then be attached to QueueValue on that node, instead of using the rather wasteful MemberLinks.

linas commented 4 years ago

FYI, The following updates are relevant:

linas commented 2 years ago

Summary and state of affairs.

linas commented 2 years ago

Closing. Replaced by #2911