Create atom type at run-time

ngeiswei commented 5 years ago

Overview

Currently the only way to create new atom types is at compile time via C++. It could be useful to allow the creation of new types at run-time via the binding languages, Scheme, Python and Haskell. It should be possible to define their inheritance relationship with the other types, as well as their associated executions if any.

Pro

More user friendly than C++ compile-time.
Maybe can open-up some possibilities if Atomese itself can create atom types.

Cons

According to @linas here is a list of cons (extracted from https://github.com/opencog/atomspace/pull/2180#issuecomment-495770974):

So far, no one has really needed them. ~~In the name of minimalism, one should not implement features that no one wants.~~ (except its already implemented, with C++ bindings; all that is missing are python and scheme bindings for the existing code)
~~Obviously, the C++-backed atoms need to be defined at compile time. Currently, maybe half, or more are defined at compile time.~~ (Not an issue)
~~There is a performance hit: AtomTable holds a std::vector of length NUM_TYPES - That would need to be a hash table, and it could slow down atom-space references by 2x maybe.~~ (Not a problem; see below)
~~Restoring atoms from SQL whose definitions have been deleted is .. troublesome~~ (not a problem; see below)
~~Python and scheme bindings are troublesome...~~ (not an issue, see below)
~~Value types and Atom types are contiguous. This MUST be fixed.~~ (fixed in #2192)
Naive users should be discouraged from creating a million new Atom Types. (1K or 2K is OK; 1M is not)

ngeiswei commented 5 years ago

@linas, regarding

There is a performance hit: AtomTable holds a std::vector of length NUM_TYPES - That would need to be a hash table, and it could slow down atom-space references by 2x maybe. Aotomspace access is already too slow...

Why would you need a hash table, can't the new types be simply pushed on that vector?

Hmm, or maybe it would require that their number ID be content generated (given their names and inheritance relationship with the other types) which indeed would require a hash...

vsbogd commented 5 years ago

There is a performance hit: AtomTable holds a std::vector of length NUM_TYPES - That would need to be a hash table, and it could slow down atom-space references by 2x maybe. Aotomspace access is already too slow...

@linas, do you mean this vector: https://github.com/opencog/atomspace/blob/7b060552a0d7bf2ba04988aaca929380b576dbdc/opencog/atomspace/AtomTable.h#L88-L89

It looks like it is accessed in very complex methods add and extract and most affected method should be get_handles_by_type which is used in InitiateSearcCB.h. So it is not clear if impact is serious.

vsbogd commented 5 years ago

I would add that adding additional C++ Atom types in external libraries without modifying atomspace code would also be useful.

Common way of adding such custom extensions is specifying reserved range of type IDs for them. If one would like adding new type he or she should use type id which is greater than 1000000.

Restoring atoms from SQL whose definitions have been deleted is .. troublesome

If user decides using custom extensions it is responsible for loading code for extensions before loading atomspace from database. Atomspace could provide additional check for example that name for extension is equal to expected to prevent mixing extensions with same ID.

ngeiswei commented 5 years ago

I would add that adding additional C++ Atom types in external libraries without modifying atomspace code would also be useful.

Depending on what you mean I believe it is already possible, see for instance https://github.com/opencog/opencog/tree/master/opencog/spacetime/atom-types

vsbogd commented 5 years ago

Depending on what you mean I believe it is already possible, see for instance https://github.com/opencog/opencog/tree/master/opencog/spacetime/atom-types

Yes you are right, but I am not sure that id correctness doesn't depend on library loading sequence.

linas commented 5 years ago

Why would you need a hash table,

Right. It could stay as a vector table

linas commented 5 years ago

adding additional C++ Atom types in external libraries without modifying atomspace code would also be useful.

This is already possible. There are already half-a dozen different little libraries that define custom atoms types -- for NLP, for the spacetime server, for agi-bio, others.

linas commented 5 years ago

greater than 1000000.

Currently, Type is a short-int, so 65536 max. The nameserver and classserver allocate type ID's one after another, in linear order.

linas commented 5 years ago

SQL -- prevent mixing extensions with same ID.

This is currently not a problem. The SQL storage stores types by name, and not by ID. So this already avoids almost all naming conflicts. I think there's even a unit test for this. Not sure.

Restoring atoms from SQL whose definitions have been deleted is .. troublesome

I don't know what I was thinking when I wrote that. Its not a problem; its already automatically handled.

linas commented 5 years ago

I am not sure that id correctness doesn't depend on library loading sequence.

There is no concept of "id correnctness". libraries that define different atom typs can be loaded in arbitrary order (well, core has to be loaded first) As long as the libraries are loaded in dependency order, there are no known issues.

linas commented 5 years ago

OK, I crossed out most of the cons -- they aren't issues after all. I'm not sure what I was thinking. In a certain sense, it should not be hard -- de facto, one can already define run-time types using the existing C++ interfaces, with no changes at all. Basically, all you need to do is to call the various methods in NameServer.h and ClassServer.h - call them in the right order, provide the correct arguments, and that's it. You're done.

All that's missing are Scheme and python bindings to NameServer.h, ClassServer.h

There is one big remaining gotcha that needs to be fixed first, however, see next post.

linas commented 5 years ago

There is one big remaining gotcha that needs to be fixed first: Currently, both Values and Atoms are stored in sequential order, and, since Atom is-a kind-of Value, the numeric ID's issued to these are issued in sequential order. There is no gap between the highest Value-type-number, and the lowest Atom-type-number (which is ATOM) This prevents new Values from being inserted into the middle of the sequence. This is already a problem for the SpaceServer Values...

I see two fixes. I prefer the first, because its lower complexity, and its safer.

At compile time, reserve a dead spot of un-issued ID's between the highest Value-type-number, and the lowest Atom-type-number. Reserving 100 slots for new, user-defined values should be much more than enough, for the current rate of consumption. And if, in the future, 100 is not enough .. change it to 1K and recompile. It's really no big deal.
Split the array of Values and the array of Atoms into two. Many intellectual purists will beleive that this is somehow "better" or "cleaner", but in fact, this will have a large number of troublesome side-effects all through the code. I don't like it at all. For a while, I was very concerned about this, until I realized that option 1. above is not only a good solution, but is the better solution.

linas commented 5 years ago

To summarize, recap: You can already add new types, at runtime, at any time at all, during runtime. The core code for this is already written and already works: its the code in Nameserver.h and in Classserver.h. All that is missing are python and scheme bindings to the existing API.

I added one additional concern to the list: we need to discourage naive users from the idea that they will create a million new atom types. So, 1K or 2K new atom types is OK, a million is not. What I fear is that someone clever but naive will decide that it's a good idea to create a new type for each and every protein, or DNA sequence, or human being on the planet, every word in the dictionary, or something like that, and my knee-jerk reaction would be that this is a mis-use of the type system.

I mostly want to restrict new types to those types that actually need to have C++ classes behind them (i.e. "continuations", as the currently fashionable term). If you don't need the C++ class, then you probably don't need the type, and should just use PredicateNode or ConceptNode instead. That's my knee-jerk reaction. Someone would need to invent an entertaining and convincing story for why it should be otherwise...

linas commented 5 years ago

Heh. With NameServer::typeAddedSignal() you can even be notified when someone else adds a new atom type at run-time. I forgot about that. It seems there are only three calls total, for adding run-time types:

    bool beginTypeDecls(const char * module_name);
    void endTypeDecls(void);
    /**
     * Adds a new atom type with the given name and parent type.
     * Return a numeric value that is assigned to the new type.
     */
    Type declType(const Type parent, const std::string& name);

That's it. This is a nearly trivial interface. So, except for making some nice, elegant python, scheme API's into this, its, uhh, code-complete already. So there. (I kind-a forgot that this was the case.)

vsbogd commented 5 years ago

The SQL storage stores types by name, and not by ID.

Ok, so type id is just temporary id for current running session. Then it should not conflict with anything at all.

There is no gap between the highest Value-type-number, and the lowest Atom-type-number (which is ATOM) This prevents new Values from being inserted into the middle of the sequence.

Why we cannot add next new type of Value at the end of the sequence?

linas commented 5 years ago

Ok, so type id is just temporary id for current running session.

Yes.

Why we cannot add next new type of Value at the end of the sequence?

Good question. I'm not sure. There are a handful of locations in the code which assume that all Values have an numeric code that is less that ATOM. Its some historical precedent that made sense at the time. I do not currently recall the reasoning for this. Maybe all of these can be fixed. Anyway, I just now merged pull req #2192 which fixes this. So this should no longer be an issue.

linas commented 5 years ago

@ngeiswei I think this has a green light. I suggest just closing this issue, and opening two new ones:

Create scheme bindings for Nameserver
Create python bindings for Nameserver

buj commented 5 years ago

I would like to try my hand at this, it does not look too complex.

linas commented 5 years ago

@buj - Go for it.

Although now I am getting cold feet again: if you create new atoms at run-time in one language, none of the other three will have visibility to them (e.g. we have C++, scheme, python, haskell) the current system auto-generates atoms for all four. But if you create new atoms, say, in python only, then they will not be available to scheme, unless you also write some magic code to create the required scheme wrappers. They will not be available to C++, ever, since C++ requires compile-time knowledge.

Finally, if you are defining new atoms in python, and you do this in some python snippet that has to always be run, before you do anything else, then .. how is this better than what we have today?

So, I'm thinking -- maybe this is "premature optimization" -- you are creating a subsystem that no one actually needs. After it's creation, that subsystem will require ongoing maintenance, for years, for a decade. Who is going to do the maintenance? Do we need to maintain a system, if there are no users for it?

The long-term maintenance is the scariest thing about all this. The long-term maintenance has been the #1 hardest thing about the atomspace.

buj commented 5 years ago

Although now I am getting cold feet again: if you create new atoms at run-time in one language, none of the other three will have visibility to them (e.g. we have C++, scheme, python, haskell) the current system auto-generates atoms for all four. But if you create new atoms, say, in python only, then they will not be available to scheme, unless you also write some magic code to create the required scheme wrappers. They will not be available to C++, ever, since C++ requires compile-time knowledge.

I'm not sure which problem you're addressing:

Is it that when one application "hooks up" to a run-time (is it even possible?), it will not see (be able to manipulate/create/...) the custom-typed atoms unless the app was written in the same language (and presumably imported some libraries)?
Or is it that the custom defined type will only be usable in that one language? ("one cannot import python module from C++")

Regarding the latter: for small projects this should not be a problem. For large projects... you can still do it the standard way (through C++).

Finally, if you are defining new atoms in p ython, and you do this in some python snippet that has to always be run, before you do anything else, then .. how is this better than what we have today?

Not in many ways, it is merely convenience, as Nil has mentioned. One way of looking at it is that the type hierarchy is such a core part of atomspace, it does not make sense to be able to add new types through scripting languages---the type system should be rich enough that one does not have to do that. However, I'm not sure we're there yet.

Currently, if one needs some custom defined type, one has to implement it into the C++ core, which (as I understand it) means forking atomspace... This is definitely a problem, especially for smaller projects/prototyping---it would be nicer if one could define these custom types directly from the scripting language.

So, I'm thinking -- maybe this is "premature optimization" -- you are creating a subsystem that no one actually needs.

About the userbase. I think that if a feature is available, it will naturally be picked up by some users. Without the feature, many users would instead find some workaround for the problem---an alternative solution, and not bother bugging the developers with "hey, this feature would be cool, care to add it?". (Though I might be wrong, I'm not sure about the culture here at opencog.) So having useful features for which there is no explicit demand is good.

The question is, is this feature useful? I think I have made some arguments for it in the above paragraphs, but I suppose it would be nice to collectively agree on this before anything is done.

After it's creation, that subsystem will require ongoing maintenance, for years, for a decade. Who is going to do the maintenance? Do we need to maintain a system, if there are no users for it?

The long-term maintenance is the scariest thing about all this. The long-term maintenance has been the #1 hardest thing about the atomspace.

That is the cost of implementing it (though I do not think it is too large...). The benefits should outweigh the costs (do they? can we even estimate these two?).

linas commented 5 years ago

I'm not sure which problem you're addressing:

If you are diligent with the API, then you could set it up so that, when a user dynamically creates a new type in python, then that same type also shows up with the correct bindings in guile. This is doable; the problem with it is that it adds complexity.

which (as I understand it) means forking atomspace...

No it doesn't. There are 4 or 5 different project that add custom types; its simple, ites easy, see for example https://github.com/opencog/agi-bio/tree/master/bioscience/types which requires maybe 10 lines of code, total.

many users would instead find some workaround for the problem

What's the problem? No one has articulated what the problem actually is.

The question is, is this feature useful?

Ah .. that's what the question really is. And also the next few questions.

That is the cost of implementing it (though I do not think it is too large...). The benefits should outweigh the costs (do they? can we even estimate these two?).

The cost of implementing for scheme-only is fairly low. Maybe 100 or 200 LOC. (without unit tests) But if you want types defined in scheme to also automatically show up in python, then its a lot more complex, requiring listening to and catching signals, possibly redefining the python type subsystem. Definitely needs sophisticated unit tests. So maybe a factor of 10x more lines of code.

I mean, if you write this code, and its reasonably clean and appears to work, I''ll probably merge it. It's not really that big a deal. I'm just trying to recommend a design/engineering process that embraces "clean design", a kind of minimalism. Software with Baroque Rococo stylings is ... hard to use, hard to understand, hard to maintain. Always ask yourself "is this really needed?"

ngeiswei commented 5 years ago

Even without auto-generated helpers to access in one language the types created in another, I still think it's a useful feature, but certainly it's not urgent we can give some more thoughts before deciding or not to implement it.

linas commented 1 year ago

See also #2901 as a possible duplicate of this. Also, a specific Atomese API is proposed in #2901

opencog / atomspace

Create atom type at run-time #2190

Overview

Pro

Cons