Open noskill opened 5 years ago
see also my arguments on caching vs deterministic grounded schema nodes https://github.com/opencog/atomspace/issues/1907#issuecomment-439340307
Seems reasonable. However: unless there is a current, actual need for this, right now, I suggest that its a poor idea to implement this right now. Why?
It adds implementation complexity to an already-complex system.
It incurs extra overhead costs for all users, during both the pattern compilation and the pattern runtime steps, that makes all users run slightly slower. If there is no one user who will run a lot faster as a result, the change is not worth it.
Users with inefficient grounded schema do have a work-around, so this is not a critical change. That is, they can get a comparable performance boost in other fairly simple ways.
So: are there use cases, e.g. in PLN, where this would make a difference?
@noskill, are you suggesting to add some optional argument in cog-execute!
, something like
(cog-execute! my-bind-link #:cache #t)
?
The other thing I had suggested, besides having the caching take place within the grounded code, is to define a CachedGroundedSchemaNode
type and implement a generic cache within its factory.
Or more abstractly a StatelessGroundedSchemaNode
(as @linas might have suggested, IIRC).
@linas said
It adds implementation complexity to an already-complex system.
Agreed, but it's also I think a fairly reusable functionality.
It incurs extra overhead costs for all users, during both the pattern compilation and the pattern runtime steps, that makes all users run slightly slower.
Why? Such feature should be optional. It would only be costly if enabled, right? Or maybe you mean in case it is the default. Then yes, users who do not want this feature will pay the extra cost. I would bet that most users who do not want caching care more about its effects than its cost though. Think for instance of a pattern composed of multiple random generator calls, caching then makes a big difference in behavior.
Users with inefficient grounded schema do have a work-around, so this is not a critical change. That is, they can get a comparable performance boost in other fairly simple ways.
Agreed as well.
All to say, I'm fairly neutral on that issue, on one hand it seems like something many users would want, on the other it might not be the most critical thing to add ATM. I'd be totally open with adding a StatelessGroundedSchemaLink
and StatelessGroundedPredicateLink
, it's pretty clear to me this would be used time and time again.
Their could probably be a StatelessLink
that any stateless specialized link could inherit. It would lead to an elegant implementation. Now maybe one doesn't want to equate stateless to caching, so maybe some extra option might want to be introduced, I don't know. I don't might equating stateless to caching and see how it unfolds.
I have got another idea of implementing it, may be it was already evaluated. In regex queries one can specify the part of the query to return in the result, I mean something like s/(.+)/\1/
where \1
returns first matched group (.+)
.
In bindlink in query clauses we could have an atom to save results of the calculation in a variable and then use this variable in rewriting clauses to write exactly what was calculated during query matching.
(define bind-link
(BindLink
(VariableList
(Variable "$X")
(Variable "$Y")
(Variable "$R"))
(And
(Inheritance
(Variable "$X")
(Concept "color"))
(Inheritance
(Variable "$Y")
(Concept "color"))
(Evaluation
(GroundedPredicate "scm: classify")
(Let
(Variable "$R")
(ExecutionOutput
(GroundedSchema "scm: schema2")
(List
(Variable "$X")
(Variable "$Y")))
))
(Variable "$R")
)
)
I misunderstood when recalculation of evaluation link happens. So there are cases: 1) In bindlink rewrite clause 2) During consecutive calls by rule engine
2 is addressable by external caching 1 is addressable by Vitaly's proposal or by stateless/caching links.
There already is an equivalent of Let
, its called StateLink
. However,
1) No clue what happens if you use it in a BindLink like that; it might mis-behave.
2) StateLink
is global -- it always lives in the atomspace, and everyone sees it. By contrast, I assume the LetLink
goes out of scope, once the search is done. Thus, StateLink cannot be safely used in parallel searches in different threads.
Overall, I think I like the idea of a locally-scoped LetLink
. See below.
Answring @ngeiswei question:
extra overhead costs for all users
Let's assume that there is no LetLink
. Then during pattern compilation, we'd have to look at every clause, look for every grounded schema, then look at every other clause, find any schema there, and then try to decide if it is of the same kind, or if its different. If it's the same kind, then, somewhere in class Pattern
, we record this (in some std::map
, for example), with the std::map
acting as a private, internal let
to hold the two (or more) locations. The above search happens even if there is no grounded-anything in the pattern. Its a small overhead, but its still overhead. (We still don't have a BindLink benchmark...)
Next, during pattern execution, every time a Grounded-something is encountered, then that std::map
in class pattern
has to be consulted, to see if that Grounded-something needs to be memoized. The std::map
has to be consulted every time, even if it's empty.
OK, now please join me on the following wild ride.
It seems like the LetLink
is a good idea. It is easy to implement. Since it is of local scope, it can be safely used if performing searches in parallel in multiple threads. Note that, since it is purely local, it only has to memorize one thing; it does NOT have to memorize N things for N values! This makes it simple, fast, low-cost, with no risk of exploding RAM usage (cache misses, etc.)
Since a LetLink
is locally-scoped, it cannot be used across different terms in the rule engine.
StateLink
is globally scoped. However, it can only remember one thing at a time, and has no awareness of "arguments", or execution-anythings, or Grounded-anythings.
So, let's imagine a CacheLink
which might look like the following:
(CacheLink
(VariableList (Variable $X) (Variable $Y))
(ExecutionOutputLink
(GroundedSchema "py:foobar")
(List (Variable $X) (Variable $Y))))
The semantics of this is that, if the return value for X,Y is already known, then it is returned. If it is not known, then the GroundedSchema would be executed, and the resulting return value would be memorized.
So: every time that a GroundedSchema
is evaluated, we are supposed to put an ordinary schema in the atomspace. Viz. every time we run
(ExecutionOutputLink
(GroundedSchema "py:foobar")
(List (Concept "banana") (Concept "apple")))
we are supposed to create the following, in the atomspace:
(ExecutionLink
(Schema "foobar")
(List (Concept "banana") (Concept "apple"))
(Concept "banapple"))
And also the converse: every time that we encounter a GroundedSchema
, we are supposed to first look to see if there already exists an ordinary ExecutionLink
that already holds the desired answer, and use that cached value, first, and only call the GroundedSchema
if the ExecutionLink
does not exist. See https://wiki.opencog.org/w/ExecutionLink
Now, for some critiques and comments.
A lot of stuff has happened in ten years. Back then, using ExecutionLinks to cache the results of ExecutionOutputLinks made sense. Back then, the pattern-matcher did not exist. Question: does this old idea still make sense? Or is it a bad idea?
I kind-of-like the CacheLink
idea. But if atomspaces are caches, maybe we should formalize: rename CacheLink
to AtomSpaceLink
? That way, we really can have multiple atomspaces in a controllable fashion. Question: what are the details of making this work? (Please note; there are other reasons for having an AtomSpaceLink; it simplifies certain distributed-processing scenarios.)
I would separate following issues:
(1) how to return the same thing which was calculated by ExecutionOutputLink
during pattern matcher search (this issue)
(2) how to not calculate ExecutionOutputLink
with the same parameters twice if it is required (issue https://github.com/opencog/atomspace/issues/1907)
There are solutions for the (2) which also solve (1). For instance AtomspaceLink
if I got the idea right. (my understanding AtomspaceLink
will cause adding ExecutionOutputLink
results to the atomspace in form of ExecutionLink
).
I think CacheLink
is not declarative enough to be a part of atomese. LetLink
which I meant should not cache but just designate some part of the query to make sure the same part of query is used in few different places. And it cannot be used to solve problem (2), it solves only (1). Yes, it is more similar to StateLink
.
One of the side effects of the AtomspaceLink
I see is that it adds ExecutionLink
to atomspace each time it is calculated but not only when all clauses matches. @linas do you mean that AtomspaceLink
within pattern matcher clause will keep results in internal pattern matcher fork of the atomspace?
Looking at ExecutionOutputLink wiki I have found that ExecutionLink
can be used to calculate the result of the ExecutionOutputLink
and return it as new ExecutionLink
instance. It sounds similar to AtomspaceLink
as I understand it.
@vsbogd as far as I understand it, AtomspaceLink
would insert the atom in a designed atomspace, not necessarily the existing one. The advantage of inserting in a new atomspace is that it would be really fast to retrieve and wouldn't clobber the existing atomspace(s).
The following issue mentions using ExecutionLink
while executing ExecutionOutputLink
#1795.
BTW, there is no ExecutionLink::do_execute()
as mentioned in https://wiki.opencog.org/w/ExecutionOutputLink#Execution I think @linas means ExecutionOutputLink::do_execute
.
@vsbogd yes, separating this into two issues is a good idea. I do like LetLink
because it seems simple, "obvious", not hard to code up, and has a minimal performance impact on non-users. So if you really really want to have it, go for it. It seems harmless. (I cannot resist a mean joke at this point. Or a warning: Nil will find some way of combining it with a Quote/UnquoteLink and it will drive you crazy.)
The second part is this muddy mish-mash of Execution vs. ExecutionOutput vs CacheLink vs. AtomSpaceLink and it is quite unclear exactly how these should work and interact. When I start thinking about the details, I get a mess that I don't quite like. There are unpleasant side-effects, unpleasant implications. I could write a long post where I ponder the pros and cons of different implementations for "all this". But it would be hard to write, and maybe not-fun to read. So I encourage: maybe you should think about the best way to do something like that, and what it would mean. Then we can all compare notes?
I mean, I could spend a day or two or more, pondering the question of Execution vs. ExecutionOutput vs CacheLink vs. AtomSpaceLink and how that would all work. And perhaps, after ten years of futzing with other things, the time has come for this. ... Hmmm.
Below is my attempt to understand from the very beginning whether pattern matcher should treat ExecutionOutputLink
as a function without side effects or not.
I can see three cases:
ExecutionOutputLink
is like arithmetic operation has no side effects and returns same result each timeExecutionOutputLink
learns doing something and returns more and more precise result each time it is calledExecutionOutputLink
is like measuring temperature (or another environment condition) which may differ from measurement to measurement.Only case which is really broken by unconditional caching of ExecutionOutputLink
results is case (2). But it is questionable whether case (2) makes sense as many searches can be performed to improve result instead of making many calls during one search.
Natural solution for the case (3) is doing measurement before search add results to the atomspace and use them in search query. Same technique can be applied to make search in case (1) more effective. If ExecutionOutputLink
parameters has variables which should be grounded during the search then query could be splitted on two: first to ground variables; execution ExecutionOutputLink
; second search to find result of original query.
In other words I don't see the reasons for pattern matcher to treat ExecutionOutputLink
as function which has side effects if pattern matcher performs a search. Doing so means that each time search returns another results. But it makes sense only for case (2) and it can be solved by using more searches. May be I don't quite understand the set of problems which pattern matcher solves.
I didn't thought about performance here. I would like to understand whether we have an use case where ExecutionOutputLink
used within pattern matcher query do have side effects by design.
I didn't thought about performance here. I would like to understand whether we have an use case where ExecutionOutputLink used within pattern matcher query do have side effects by design.
The only useful case can think of is when calling the same random generator at different places of the clauses. I've never needed that but I think @linas has when experimenting with controlling Sophia.
I can think of scenarios where every time some schema is executed it increases the strength of some TV, and some virtual link also depends on that TV, then it would make a difference, but do we need that?
Generally speaking stateless is preferable when possible, I think.
If it were to become the default I don't think it would negatively impact my work for instance, but that may differ for others.
I like the LetLink
idea.
I also like the StatelessLink
idea and I don't see anything detrimental about it. Surely the overhead would be negligible for non-users. On top of that if it speeds up stateless usages (which is probably most cases) then it's a win. Not even arguing about it being the default, appending Stateless
in front of a link type is not much to ask anyway.
To answer @vsbogd question: ExecutionOutputLink
is supposed to call the GroundedSchemaNode
, which causes scheme or python code to run. It is impossible to know what that code might do, so we have to assume it can do anything, including turning on the lights in a room on the other side of the planet.
In the distant past, the Sophia robot has made extensive use of GroundedSchemaNode
and GroundedPredicateNode
to attach to ROS interfaces to both sensory inputs, and to motor outputs. Sensory inputs are things like "face 42 is now visible", and outputs include things like "smile and blink three times". The ROS interfaces are not in the opencog repo, mostly because I did not want to make ROS a pre-req for building opencog. I'm not sure what repo it is in, and I am not sure if GroundedSchemaNode
is still used (but it probably is?)
I don't think the robot code uses GroundedSchemaNode
in patterns (but I could be wrong; the ghost people would know for sure). Mostly, the ExecutionOutputLinks
were used with PutLink
; so we would use GetLink
(i.e. the pattern matcher) to figure out if the robot should smile and frown, and then use PutLink ExecutiounOutputLink
to send the smile/frown command to the robot.
Also: some of the demos explicitly show stateful behavior. One example involves a drag-race light-pole: three blinks of red, one blink yellow, one blink green. I don't recall if this uses GroundedSchema or GroundedPredicate. Another example builds a finite state machine in atomese. Both examples are also unit tests.
@linas, yes, sure GroundedSchemaNode
can call some stateful python code. And I have tried to cover your example with Sophia (when GroundedSchemaNode
returns the state of Sophia's sensors) in my comment above as case (3).
The question is whether it is acceptable for pattern matcher functionality to pre-cache the result of GroundedSchemaNode
before running query. My opinion it is acceptable because pattern matcher query is not some endless process. It gets an input and returns result. When the state of the environment changes system uses another call to pattern matcher in order to get new result after the change. And this processing happens out of pattern matcher implementation.
I remember finite state machine example I will look at it closer.
Yes, it is acceptable for the pattern matcher to cache the result of GroundedSchemaNode
during the query.
To "pre-cache before the query" doesn't make sense: there are presumably one or more variables, and you can't know what values those variables might take, unless you run the query.
Caching anything "before the query" brings us back to the tangle of ideas about recording (saving) results of ExecutionOutput
into ExecutionLink
and/or some kind of CacheLink
and/or some kind of AtomSpaceLink
.
The reason that it is a tangle is that there are no clear-cut use cases to aid thinking about how these should work. I don't want to invent a solution that is looking for a problem to solve. The initial example from @noskill is clear-cut, and seems to have a fairly nice, clear-cut solution with LetLink
but it's not a strong enough example to drive deeper, more complex changes.
@linas, you said
Yes, it is acceptable for the pattern matcher to cache the result of GroundedSchemaNode during the query.
but you also said
but it's not a strong enough example to drive deeper, more complex changes
So what is the conclusion? Are you OK with modifying the pattern matcher to cache results within a query (not across queries, we all agree on that)?
I'm asking because in that case I don't think LetLink
is necessary anymore. I think within-query caching is actually the simplest thing to implement, no need to introduce LetLink
or StatelessLink
and I don't anticipate problems with it. The only problem I can think of is wanting to reuse say the same random generator across different clauses within the same query. It's not a problem for me but I wonder if it would be a problem for someone else...
Are you OK with modifying the pattern matcher to cache results within a query
These were my concerns. Two of them. One was is that I'm pretty sure this will break existing unit tests and existing examples, e.g. the red-light, green-light demo. A second reason is that I think it is computationally expensive to figure out what can be cached when, and where. The reason I like LetLink is because it removes the computational overhead. It tells you exactly what is allowed to be cached.
Maybe I'm wrong about both. There are many stateful GroundedPredicate demos and unit tests. Maybe none of them are inside of BindLink/GetLink. I don't recall. So maybe this chage can be made, without breaking unit tests. I don't know.
Re; computational complexity: perhaps the implementation can be trivial. Maybe the results are already cached in the c++ structures in PatternLink and/or the patternMatchEngine. If they are not yet there, I can imagine that maybe they are easy to add. So maybe this is actually easy to implement. So, go try it. If it's easy, do it.
Re: random generator its got nothing to do with that. For the robot, we use GroundedSchema's for obtaining the locations of people, interfacing to the face-recognition subsystem, etc. which are very definitely stateful. However, they can be treated as being stateless, within a single run of the pattern matcher.
One was is that I'm pretty sure this will break existing unit tests and existing examples, e.g. the red-light, green-light demo.
As far as I know @noskill looked in this example recently may be he can comment on this particular case. But I agree that implementing it may break the unit tests and probably we can try to find out examples which require respecting statefulness.
A second reason is that I think it is computationally expensive to figure out what can be cached when, and where.
Agree, that it should be taken into account.
For the robot, we use GroundedSchema's for obtaining the locations of people, interfacing to the face-recognition subsystem, etc. which are very definitely stateful. However, they can be treated as being stateless, within a single run of the pattern matcher.
Thanks, it is one the things I would like to know your opinion about.
they can be treated as being stateless, within a single run of the pattern matcher.
Thanks, it is one the things I would like to know your opinion about.
I would like to think of a single run of the pattern matcher as being very much like a single, indivisible, atomic run where time stops, everything is frozen, nothing changes, and then a result is magically produced, and time starts running again. In this sense, it should be a lot like a database query, and so should resemble the ACID style, as much as possible. At this time, I think I want to avoid an ACID vs. BASE debate, at least, in this issue (we can have a different issue for an ACID vs. BASE debate). Ground rules would be to read the Microsoft paper that explains that NoSQL is actually coSQL (viz. opposite==direction of all arrows are reversed.) For opencog, the debate is harder, because we have arrows going in both directions, so taking the co- to get BASE is confusing. So we cannot, right now, have perfect ACID or BASE. Like I say, this is a different, confusing debate.
time stops
This could literally be enforced, by taking a global lock on the atomspace, thus preventing atoms from being inserted or deleted, until the query is finished. However, this would be a rather extreme step at this point; I would not want to take it. (locks prevent parallelism) So this is an example where BASE ideas are nicer.
My proposal is to add new flag to bindink function to reuse matched execution/evaluation links in pattern matcher search:
Example:
Here we want to return from bindlink exactly what was matched and used in AndLink matching. Current implementation would recompute the ExecutionOutput link.
One way to implement it is to treat GroundedSchema as deterministic function and storing evaluations into temporary atomspace. That would allow for usage of many heuristics for reducing the search space: like minimum-remaining-value heuristic.