Open vsbogd opened 5 years ago
I think I like this. But first some general commentary, just so that things are clear.
methods to copy atoms
There is one and only one valid place where it is acceptable to copy an atom: this is during atom-table insertion, so that the AtomTable can have it's own private copy. It is currently "illegal" to make copies of atoms; it should never be done. (There might exist bizarro corner cases, but I cannot think of any) The goal is to always have only one copy, ever, of any atom. So please to NOT add a _copier()
The atom factory should only be called only once, ever, and that is when an atom is inserted into the atomspace. It is used for only one purpose: to create the one-and-only, single, final version of the atom. The factory should never-ever be called from anywhere else.
Some atom constructors are very expensive. For example, the PatternLink
does a HUGE amount of calculation. That's OK, because that calculation is only done once, ever, when the PatternLink is inserted into the atomspace. We want to be careful, and not accidentally cause a PatternLink costrutor to run more than once.
The optimization problem is to move atom-style-data as fast as possible from the user-API (scheme, python, haskell) into the atomspace, where the factory runs once, and returns. By "atom-style-data", I mean: a type, a node-string, a link-handle-seq and maybe a TV. I don't care how it is moved into the atomspace, as long as it's fast. Today, the easiest, fastest way to do this is to create an "empty shell of an atom", with createNode/createLink, and pass it around as needed. Simple, easy, uniform. But it's just an empty shell, until the factory runs and creates "the real one" (the one-and-only, true, globally-unique real one).
One performance optimzation might be to make the shell "even emptier", somehow. Maybe by not keeping the incoming set for the non-final atoms? So that any atom that is not in the atomspace also does not have an incoming set! Huh. I like that idea...
OK that's the story of factories. The stories of validators are different.
The idea of validators is still kind-of an experimental prototype, and I am not yet convinced that it is a good idea. Maybe it is, and maybe there is a better way. It kind-of-works, mostly, but its ugly and it has problems that I don't know how to solve, yet. The experiment is described here: atom_types.script
lines 66 to 107
Those six special link types are tightly coupled to the validators, to force validation.
Some atoms have factories that differ from validators. For example, ARITHMETIC_LINK
has a factory, because there is a C++ class called ArithmeticLink
. However, (see atom_types.script line 591) is also a NUMERIC_LINK
and a NUMERIC_OUTPUT_LINK
, and so those will trigger validators that are completely unlrelated to the C++ class. Those validators come from a different place.
So I think I like your proposal. Some point by point comments:
DEFINE_LINK_FACTORY
DEFINE_NODE_FACTORY
to the ClassFactory::factory
methodSure, fine, don't care.
validating_factory
in ClassServer::addValidator
method, just call proper validator instead in ClassFactory::factory
Uh sure, I guess ?? I'm not sure what this solves.
ClassServer::spliceFactory
method to splice not only factories but validators as wellYes. I think you discovered this in your unit test. I wrote splceFactory()
, and although it is only 20 lines long, it took me two or three very long days to write it; I tried half a dozen other approaches, some of which were extremely complex, and took many hundreds of lines of impossible code. I was very happy when I finally found a simple way to create spliceFactory
. By the end, I was so disgusted and tired, that I did not want to do it again (i.e. I did not feel like creating spliceValidator
) . But your unit test discovered the bug that results from my laziness.
OK
ClassServer::_copier
collection to keep methods to copy atomscopyAtom
to use it in AtomTable::add
method for atoms copyingAvoid copies, per above.
Handle (*)(const Handle& )
factory method signature by method with vararg like in ValueFactory
Careful, I had to fix a bug in ValueFactory just a few weeks ago. It works, I like the idea, but the use of the static is ... interesting, non-obvious, subtle, and error prone.
createNode
and createLink
replace new atom creation by arguments forwardingUhh, sure I guess. I don't understand what that means. Keep in mind, most atoms should be "empty shells", unborn seeds, just passing through, temporary holders for a type, a node-string and a handle-seq. it should be simple and fast to create them, and then let them die as soon as they've moved thier atom-style-date from the API to the atomspace.
OK, so here:
There a bad bug there. createNode
should not call the classserver().factory(tmp->get_handle())
That is seriously fucked up. I don't know how that happened: that is an outright bug. It needs to be fixed as soon as possible. As explained above: createNode should create a small, simple "empty shell" of an atom, as it moves from scheme/python to its final home inside the atomspace. Factory should run once and only once!
The above is a bug. Apparently, I am the one who added this bug, 16 months ago: https://github.com/opencog/atomspace/commit/7b709229e3f4ccd8156f02d585341df10615a076 I just now tried to fix it and 84 unit tests failed out of 143; most segfaulted. That's just crazy. 17 months ago, things worked fine, so what happened? Yecchhhh.
And here: https://github.com/opencog/atomspace/commit/44b212bad7efa3c92b9fc1a43103810a8cda1c57 It seems that I was trying to remove the factory from the atom table, so that the atom table code would be simpler. But after this change, the "empty shells" are no longer empty. So that was a mistake.
This stuff is hard. We don't have any good performance tests to catch these kinds of mistakes.
It's also possible that I am making a mountain out of a molehill. Maybe the main performance bottlenecks are in a completely different place...
This comment:
This is a symptom of a general confusion about when to use temporaries/empty-shells, and when one has the "real" atom. Inserting an atom into the atomspace is expensive (its slow, there is lock contention) and so there is a lot of effort made to avoid putting atoms into the atomspace, until we are really sure we want to. As a result, there are a lot of temporaries and "empty shells" floating around. But sometimes, the empty shell is not good enough, because we need to call some method on the "real" c++ class. This results in a lot of hackery and confusion.
Maybe the correct long-term fix is that, whenever we need an actual C++ instance, and the FooAtomCast
fails, then we should insert the atom into the atomspace, and try FooAtomCast
, again.
Avoid copies, per above.
I agree that only place to make an atom copy is AtomTable. And may be it is better to not copy atoms at all, just use same instance in multiple atomspaces as you suggested in https://github.com/opencog/atomspace/issues/1967 . On the other hand I would allow copying in AtomTable itself till #1967 is not implemented.
There a bad bug there. createNode should not call the classserver().factory(tmp->get_handle())
As I see previous AtomTable
code called _classserver.factory(Handle(createLink(closet, atom_type)));
, effectively chain of calls was the same. If factory method signature will be changed from Handle (*)(const Handle& )
to Handle (*)(...)
it will be possible to implement createNode
and createLink
methods by call of the factory itself.
Working on PR https://github.com/opencog/atomspace/pull/2029 I have realized that in case of multiple inheritance (for example NODE_C <- NODE_B, NODE_A) and when both NODE_A and NODE_B have validators then only validator of NODE_B will be called. Should we make a chain of validators calls to run all of type validators not only one?
Avoid copies, per above.
1967
Lets ignore that, for now. #1967 is complicated, and I kind of want newcomers to not get confused about copying.
it will be possible to implement createNode and createLink methods by call of the factory itself.
Yes, but what I am saying is that createNode and createLink almost surely should not be calling the factory. (?) They should be creating temporary "empty shells". The idea is that the "empty shell" avoids the CPU over-head of calling the ctor. The goal is that the "empty shell" is a small, light-weight, non-functional way to temporarily hold a type, a name-string, an handle-seq for just a little while.
They should be creating temporary "empty shells"
But I didn't see in code examples when empty shell is created and used before calling factory on it. And from my perspective it makes sense because all that you can do with empty shell is persist it. To call any method on it you need to convert empty shell to the instance of the right class.
Should we make a chain of validators calls to run all of type validators
Yes.
If you get the urge to rename ClassServer
to something else, that would be OK too.
I'm not sure either why you'd absolutely need an atomspace to fully instantiate and use an atom.
What is the problem with filling the shell before inserting to atomspace?
Inserting has a cost, sometimes and you'd rather keep redundant copies and avoid that cost. Well, I guess you can always have an atomspace per copy but that's an overhead.
But I didn't see in code examples when empty shell is created and used before calling factory on it.
OK. It probably doesn't matter, at this time. What I'm trying to say is that if you put a printf into some atom constructor, you will see it print twice: once, when the atom is created outside the atomspace, and once when the private-copy is created. I'm mostly just torturing myself (and torturing you) over this; there is no obvious or easy fix for this, and I should just stop talking. But now you know: most atoms get created twice.
implement createNode and createLink methods by call of the factory itself.
Yes, go ahead and do this. Sorry to have derailed the conversation so much. #2029 looks good. Is everything here now done?
Hi @ngeiswei
I'm not sure either why you'd absolutely need an atomspace to fully instantiate and use an atom.
Well, the classic example is banking. If there is one and only one globally-unique atom per bank-account, then you can safely store money in it. But if I have two atoms for one bank account, and I withdraw $500 from one of them, does that mean I still have $500 in the other?
For this reason, pretty much all databases since the dawn of time have insisted that atoms are always globally unique; chaos ensues when they are not. That said, there are 1001 exceptions where data does not have to be absolutely, totally unique.
Is everything here now done?
Not yet, will work on this further
This issue is to discuss plan of removing unnecessary double atoms creation which were discussed at issue #1965, especially in comments:
At the moment
createNode
andcreateLink
functions create instance of unspecific instance ofNode
orLink
class and passes them intoClassServer::factory
which creates another atoms with proper class.Suggested plan:
DEFINE_LINK_FACTORY
DEFINE_NODE_FACTORY
to theClassFactory::factory
methodvalidating_factory
inClassServer::addValidator
method, just call proper validator instead inClassFactory::factory
ClassServer::spliceFactory
method to splice not only factories but validators as wellClassServer::_copier
collection to keep methods to copy atomscopyAtom
to use it inAtomTable::add
method for atoms copyingHandle (*)(const Handle& )
factory method signature by method with vararg like inValueFactory
createNode
andcreateLink
replace new atom creation by arguments forwarding