Split atoms copying and atom creation factory methods and generalize ClassFactory

vsbogd commented 5 years ago

This issue is to discuss plan of removing unnecessary double atoms creation which were discussed at issue #1965, especially in comments:

At the moment createNode and createLink functions create instance of unspecific instance of Node or Link class and passes them into ClassServer::factory which creates another atoms with proper class.

Suggested plan:

move validator call from DEFINE_LINK_FACTORY DEFINE_NODE_FACTORY to the ClassFactory::factory method
stop replacing factories by validating_factory in ClassServer::addValidator method, just call proper validator instead in ClassFactory::factory
reuse ClassServer::spliceFactory method to splice not only factories but validators as well
as side effect it should fix unit tests from PR https://github.com/opencog/atomspace/pull/2023#issue-249267032
add additional ClassServer::_copier collection to keep methods to copy atoms
add additional method copyAtom to use it in AtomTable::add method for atoms copying
replace Handle (*)(const Handle& ) factory method signature by method with vararg like in ValueFactory
in createNode and createLink replace new atom creation by arguments forwarding

linas commented 5 years ago

I think I like this. But first some general commentary, just so that things are clear.

methods to copy atoms There is one and only one valid place where it is acceptable to copy an atom: this is during atom-table insertion, so that the AtomTable can have it's own private copy. It is currently "illegal" to make copies of atoms; it should never be done. (There might exist bizarro corner cases, but I cannot think of any) The goal is to always have only one copy, ever, of any atom. So please to NOT add a _copier()
The atom factory should only be called only once, ever, and that is when an atom is inserted into the atomspace. It is used for only one purpose: to create the one-and-only, single, final version of the atom. The factory should never-ever be called from anywhere else.
Some atom constructors are very expensive. For example, the PatternLink does a HUGE amount of calculation. That's OK, because that calculation is only done once, ever, when the PatternLink is inserted into the atomspace. We want to be careful, and not accidentally cause a PatternLink costrutor to run more than once.
The optimization problem is to move atom-style-data as fast as possible from the user-API (scheme, python, haskell) into the atomspace, where the factory runs once, and returns. By "atom-style-data", I mean: a type, a node-string, a link-handle-seq and maybe a TV. I don't care how it is moved into the atomspace, as long as it's fast. Today, the easiest, fastest way to do this is to create an "empty shell of an atom", with createNode/createLink, and pass it around as needed. Simple, easy, uniform. But it's just an empty shell, until the factory runs and creates "the real one" (the one-and-only, true, globally-unique real one).
One performance optimzation might be to make the shell "even emptier", somehow. Maybe by not keeping the incoming set for the non-final atoms? So that any atom that is not in the atomspace also does not have an incoming set! Huh. I like that idea...

OK that's the story of factories. The stories of validators are different.

The idea of validators is still kind-of an experimental prototype, and I am not yet convinced that it is a good idea. Maybe it is, and maybe there is a better way. It kind-of-works, mostly, but its ugly and it has problems that I don't know how to solve, yet. The experiment is described here: atom_types.script lines 66 to 107
Those six special link types are tightly coupled to the validators, to force validation.
Some atoms have factories that differ from validators. For example, ARITHMETIC_LINK has a factory, because there is a C++ class called ArithmeticLink. However, (see atom_types.script line 591) is also a NUMERIC_LINK and a NUMERIC_OUTPUT_LINK, and so those will trigger validators that are completely unlrelated to the C++ class. Those validators come from a different place.

So I think I like your proposal. Some point by point comments:

move validator call from DEFINE_LINK_FACTORY DEFINE_NODE_FACTORY to the ClassFactory::factory method

Sure, fine, don't care.

stop replacing factories by validating_factory in ClassServer::addValidator method, just call proper validator instead in ClassFactory::factory

Uh sure, I guess ?? I'm not sure what this solves.

reuse ClassServer::spliceFactory method to splice not only factories but validators as well

Yes. I think you discovered this in your unit test. I wrote splceFactory(), and although it is only 20 lines long, it took me two or three very long days to write it; I tried half a dozen other approaches, some of which were extremely complex, and took many hundreds of lines of impossible code. I was very happy when I finally found a simple way to create spliceFactory. By the end, I was so disgusted and tired, that I did not want to do it again (i.e. I did not feel like creating spliceValidator) . But your unit test discovered the bug that results from my laziness.

as side effect it should fix unit tests from PR https://github.com/opencog/atomspace/pull/2023#issue-249267032

OK

add additional ClassServer::_copier collection to keep methods to copy atoms
add additional method copyAtom to use it in AtomTable::add method for atoms copying

Avoid copies, per above.

replace Handle (*)(const Handle& ) factory method signature by method with vararg like in ValueFactory

Careful, I had to fix a bug in ValueFactory just a few weeks ago. It works, I like the idea, but the use of the static is ... interesting, non-obvious, subtle, and error prone.

in createNode and createLink replace new atom creation by arguments forwarding

Uhh, sure I guess. I don't understand what that means. Keep in mind, most atoms should be "empty shells", unborn seeds, just passing through, temporary holders for a type, a node-string and a handle-seq. it should be simple and fast to create them, and then let them die as soon as they've moved thier atom-style-date from the API to the atomspace.

linas commented 5 years ago

OK, so here:

https://github.com/opencog/atomspace/issues/1965#issue-393676171

There a bad bug there. createNode should not call the classserver().factory(tmp->get_handle()) That is seriously fucked up. I don't know how that happened: that is an outright bug. It needs to be fixed as soon as possible. As explained above: createNode should create a small, simple "empty shell" of an atom, as it moves from scheme/python to its final home inside the atomspace. Factory should run once and only once!

https://github.com/opencog/atomspace/blob/e47dc46daa278c22cf3ba0c503e12b0e1e9b025d/opencog/atoms/base/Node.h#L128-L134

The above is a bug. Apparently, I am the one who added this bug, 16 months ago: https://github.com/opencog/atomspace/commit/7b709229e3f4ccd8156f02d585341df10615a076 I just now tried to fix it and 84 unit tests failed out of 143; most segfaulted. That's just crazy. 17 months ago, things worked fine, so what happened? Yecchhhh.

And here: https://github.com/opencog/atomspace/commit/44b212bad7efa3c92b9fc1a43103810a8cda1c57 It seems that I was trying to remove the factory from the atom table, so that the atom table code would be simpler. But after this change, the "empty shells" are no longer empty. So that was a mistake.

This stuff is hard. We don't have any good performance tests to catch these kinds of mistakes.

It's also possible that I am making a mountain out of a molehill. Maybe the main performance bottlenecks are in a completely different place...

linas commented 5 years ago

This comment:

https://github.com/opencog/atomspace/issues/1965#issuecomment-449588049

This is a symptom of a general confusion about when to use temporaries/empty-shells, and when one has the "real" atom. Inserting an atom into the atomspace is expensive (its slow, there is lock contention) and so there is a lot of effort made to avoid putting atoms into the atomspace, until we are really sure we want to. As a result, there are a lot of temporaries and "empty shells" floating around. But sometimes, the empty shell is not good enough, because we need to call some method on the "real" c++ class. This results in a lot of hackery and confusion.

Maybe the correct long-term fix is that, whenever we need an actual C++ instance, and the FooAtomCast fails, then we should insert the atom into the atomspace, and try FooAtomCast, again.

vsbogd commented 5 years ago

Avoid copies, per above.

I agree that only place to make an atom copy is AtomTable. And may be it is better to not copy atoms at all, just use same instance in multiple atomspaces as you suggested in https://github.com/opencog/atomspace/issues/1967 . On the other hand I would allow copying in AtomTable itself till #1967 is not implemented.

vsbogd commented 5 years ago

There a bad bug there. createNode should not call the classserver().factory(tmp->get_handle())

As I see previous AtomTable code called _classserver.factory(Handle(createLink(closet, atom_type)));, effectively chain of calls was the same. If factory method signature will be changed from Handle (*)(const Handle& ) to Handle (*)(...) it will be possible to implement createNode and createLink methods by call of the factory itself.

vsbogd commented 5 years ago

Working on PR https://github.com/opencog/atomspace/pull/2029 I have realized that in case of multiple inheritance (for example NODE_C <- NODE_B, NODE_A) and when both NODE_A and NODE_B have validators then only validator of NODE_B will be called. Should we make a chain of validators calls to run all of type validators not only one?

linas commented 5 years ago

Avoid copies, per above.

1967

Lets ignore that, for now. #1967 is complicated, and I kind of want newcomers to not get confused about copying.

linas commented 5 years ago

it will be possible to implement createNode and createLink methods by call of the factory itself.

Yes, but what I am saying is that createNode and createLink almost surely should not be calling the factory. (?) They should be creating temporary "empty shells". The idea is that the "empty shell" avoids the CPU over-head of calling the ctor. The goal is that the "empty shell" is a small, light-weight, non-functional way to temporarily hold a type, a name-string, an handle-seq for just a little while.

vsbogd commented 5 years ago

They should be creating temporary "empty shells"

But I didn't see in code examples when empty shell is created and used before calling factory on it. And from my perspective it makes sense because all that you can do with empty shell is persist it. To call any method on it you need to convert empty shell to the instance of the right class.

linas commented 5 years ago

Should we make a chain of validators calls to run all of type validators

Yes.

linas commented 5 years ago

If you get the urge to rename ClassServer to something else, that would be OK too.

ngeiswei commented 5 years ago

I'm not sure either why you'd absolutely need an atomspace to fully instantiate and use an atom.

What is the problem with filling the shell before inserting to atomspace?

Inserting has a cost, sometimes and you'd rather keep redundant copies and avoid that cost. Well, I guess you can always have an atomspace per copy but that's an overhead.

linas commented 5 years ago

But I didn't see in code examples when empty shell is created and used before calling factory on it.

OK. It probably doesn't matter, at this time. What I'm trying to say is that if you put a printf into some atom constructor, you will see it print twice: once, when the atom is created outside the atomspace, and once when the private-copy is created. I'm mostly just torturing myself (and torturing you) over this; there is no obvious or easy fix for this, and I should just stop talking. But now you know: most atoms get created twice.

implement createNode and createLink methods by call of the factory itself.

Yes, go ahead and do this. Sorry to have derailed the conversation so much. #2029 looks good. Is everything here now done?

linas commented 5 years ago

Hi @ngeiswei

I'm not sure either why you'd absolutely need an atomspace to fully instantiate and use an atom.

Well, the classic example is banking. If there is one and only one globally-unique atom per bank-account, then you can safely store money in it. But if I have two atoms for one bank account, and I withdraw $500 from one of them, does that mean I still have $500 in the other?

For this reason, pretty much all databases since the dawn of time have insisted that atoms are always globally unique; chaos ensues when they are not. That said, there are 1001 exceptions where data does not have to be absolutely, totally unique.

vsbogd commented 5 years ago

Is everything here now done?

Not yet, will work on this further

opencog / atomspace

Split atoms copying and atom creation factory methods and generalize ClassFactory #2024

1967