Discussion: Stable and unique identifier / label for each Grounded Atom

trueagi-io / hyperon-experimental

MeTTa programming language implementation

https://metta-lang.dev

MIT License

154 stars 52 forks source link

Discussion: Stable and unique identifier / label for each Grounded Atom #427

Open luketpeterson opened 1 year ago

luketpeterson commented 1 year ago

For indexing within a Space, having a unique identifier for each atom is desirable. However there are a number of additional considerations assigning these identifiers for grounded atoms.

In a discussion between myself and @vsbogd, we uncovered a tension about what this identifier fundamentally represents; and I believe that tension is rooted in a deeper split between two different conceptual kinds of grounded atom types.

1.) Some grounded atoms represent simple values. For example wrappers around primitive types in the host language. These are like the OpenCog Classic atoms; they're conceptually immutable and changing them is functionally equivalent to replacing them with a different atom of the same type but with a different value.

2.) Other grounded atoms represent complex mutable data structures. For example atom types that wrap Spaces. We may wish to expose monads to mutate these atoms so MeTTa will be a more "functional" language (see https://github.com/trueagi-io/hyperon-experimental/issues/390) but that's aside from this discussion. Irrespective of how we track and integrate updates, I don't think we want a change to a complex atom to cause a change its identifier, because this will create considerable bookkeeping complexity and cost.

I believe therefore that the desiderata for the identifiers for the different atom types are different.

For type 1. grounded atoms:

I think we want the identifiers to be computable from each atom's contents.
We'd like the identifiers to be stable regardless of the machine it's computed on (for a future distributed atom space)
This type of identifier is conceptually like a hash

For type 2. atoms:

I think we want the identifiers to be assigned when the atom is created, and not changed by mutating the atom.
These identifiers are volatile, and shouldn't be relied upon across machines, or even runs of the program.

One option is to conceptually separate these types of identifiers in two, by adding a new interface to the Grounded trait, along the lines of:

fn stable_id(&self) -> Option<Some64BitType>;

Therefore, grounded atoms that return a value here can be assumed to be of type 1, and atoms that return None will be assumed to be volatile data structures.

What do you think, @Necr0x0Der and @Adam-Vandervorst?

Adam-Vandervorst commented 1 year ago

I think the first type can be best thought of as "symbols with an algebra on top". I.e., normal symbols don't support any operations, and these do, but other than that they should be treated the same.

The "conceptual hash" can be acheived in a vareity of ways, like flag bits of integers (like Haskell) (or a temporary less messy approach), prefix trees with integer leaves for strings, but the real issue is user-defined types. Users are bad at writing hash functions, and even worse at writing datastructures eliminating the need for them. CZ2 uses namespaces, but this assumes that all values of a given type are unique unless explicitely unified. That is, two classes (even if they share the same value) are considered different (for user types). This is different from built in types in CZ2 like ints and strings, which are content-addressed.

So even though we may like two types to be distinct, it may not be easy to rely upon for arbitrary grounded atoms.

For networked systems, we may also want to look at longer identifiers, like OpenCog classic's UUIDs.

There's also the option of being explicit: have a MeTTa function value that returns the values associated with an id, in the case we don't do content-addressing by default.

luketpeterson commented 1 year ago

I think the first type can be best thought of as "symbols with an algebra on top".

That's a very good point! Rather than trying to shoe-horn two conceptually different types of object into Grounded Atom, it might be better to extend Symbol Atom so certain subclasses of symbol atoms can have custom parsing, stringification for display, and native operations that can interact with them. I like this idea a lot.

Adam-Vandervorst commented 1 year ago

Yes, though the "functional programming interpretation" would lay the responsibility on the functions. I.e. instead of symbols overwriting (OOP term) it'd be implementing support for symbols. Concretely, say you have integer addition add and HDC bind xor, it'd be the add's responsibility to load the integers from the symbol identifiers (which would just be cast or a bitwise op), compute it, and wrap it back in an identifier, and xor's responsibility to load hypervectors from a store using an index parsed from the identifier, do the calculation, store the result, and put the new index (along with it's type identifier) back into an symbol.

Ultimately the two ways of looking at the problem are equivalent, "extending functions to support" is the MeTTa way, but I wouldn't mind a performant backend diverging from it and just dispatching the right methods.

It's worth it to ponder how both options scale to pre-compiled libraries, network stuff, JIT, and more future directions.

Necr0x0Der commented 1 year ago

I don't like the idea to mix up symbols and grounded atoms in a hard-coded way. Symbols should be fully interpretable in MeTTa. If you want an algebra over symbols, don't implement it in an opaque way in some imperative language. Implement it in MeTTa. Of course, there are always intermediate cases, when we want both symbol-like behavior and efficient computations. But this can be done in other ways without introducing substructures to symbols. Once again, we already have expressions for describing symbolic structures. While there is indeed some variety of grounded atoms, it's not a dichotomy. For example, numbers as grounded atoms are not mutable, but it makes not too much sense to search for very particular number in an efficient way (if there is such a need, turn these numbers into symbols). In contrary, states are intentionally mutable, but we may want them to be efficiently searchable. A whole neural network wrapped in a grounded atom can have hash, but not only for searching but for fast comparison (e.g. we've loaded a network to GPU and want to check if this is the same network or not, when we want to switch to another checkpoint). We may want to search for images in a huge collection. But it is a different use case from retrieving a MeTTa expression containing a grounded atom wrapping an image. Hashing, indexing, retrieval for grounded atoms cannot be done in a universal way, and we cannot demand to craft a hashing function for each type of the grounded data as Adam mentioned. We do want to make searching for grounded atoms fast in some cases, but there are different options. First of all, different spaces may do indexing in different ways (e.g., DAS), and we cannot require them to do this in one way. It is not a bug, but a feature, because different spaces are optimized for different use cases. If one wants to have a space for images, then not just a special hash, but special retrieval mechanisms will be implemented for them (which may or may not retrieve not identical but just similar images, for example). As a consequence, hashes are optional for grounded atoms and if they have IDs, this IDs live inside corresponding spaces. If we want grounded atoms with global UUIDs, the answer will be not to introduce this as a requirement to any grounded atom, but to introduce a special grounded atom, which wraps any data with the requirement for it to have a hash function. Thus, the right question is what use cases we have for the in-RAM space, and wouldn't it better to solve these cases by other approaches?.. In general, I imagined that each type of grounded atom may (optionally) have its own retrieval mechanism. If it has, it is responsible for creating an index and search within this index (it is an extension of custom pattern matching from individual grounded atoms to their types). Relying on the retrieval mechanisms designed for symbolic expressions by providing hashes for individual atoms is a weak approach. I'm not sure, though, to what extent custom indexes for grounded atom types is easy to introduce into the current implementation.

luketpeterson commented 1 year ago

Based on the above, I go down the following chain of logic. Please correct me if I've made some leaps that don't logically follow.

1.) A Space is fundamentally responsible for keeping track of its atoms 2.) Grounded atom interfaces are broadly customizable and it's not realistic to shoehorn all grounded atoms into something like a 64bit GUID. 3.) Linearly calling Grounded::match_ on all atoms in a space for each query is a non-starter.

Therefore: 4.) Grounded atoms can only be matched in a Space if either A.) The space is aware of the particular structure of the GroundedAtom type, and is thus able to index it. or B.) The GroundedAtom is embedded in an expression that otherwise matches a query, and narrows down the possible atoms to a very small set.

But a big class of GroundedAtoms should be efficiently indexable in the default Space implementation. So we're back to an optional interface, as was proposed initially.

Necr0x0Der commented 1 year ago

1.) A Space is fundamentally responsible for keeping track of its atoms

I'd say, yes. I'd say, definitely yes, if we consider specific grounded atom types as a part of some space (atm, it is not precisely technically true, but still conceptually true).

2.) Grounded atom interfaces are broadly customizable and it's not realistic to shoehorn all grounded atoms into something like a 64bit GUID.

I'd say, yes. At best, we can ho have a global ID for all atoms for some concrete default or broadly used type of space (say, DAS, but not necessarily), if this is a particular feature of this space. But I'd say that this should not be the case for the in-RAM space of the interpreter.

3.) Linearly calling Grounded::match_ on all atoms in a space for each query is a non-starter.

Linearly calling it is definitely not scalable. There is a benchmark for a not hashed State for this https://github.com/trueagi-io/hyperon-experimental/blob/main/lib/benches/states.rs Apparently, it is feasible only for rather small spaces.

4.) A.)

Kind of yes. Words are a little bit vague, so it depends on precise meaning of "space is aware".

4.) B.)

yes

But a big class of GroundedAtoms should be efficiently indexable in the default Space implementation. So we're back to an optional interface, as was proposed initially.

Yes, but the question is in details. The interface should be optional. Big classes of GroundedAtoms can be responsible for indexing and retrieval. The indexing and retrieval mechanism for such big classes should not be hardcoded in the default Space implementation. It should be kept modular and customarily extensible. Is an index built by a custom GroundedAtom type a part of a Space -- this is where the wordings become a little vague.

In any case, there can be a default indexing / querying implementation, which requires only hashes from those GroundedAtom types, which are ok with it. Thus, one can implement just a hash function and be happy. However, the Space querying mechanism should not rely on these hashes. Instead, it should delegate querying to GroundedAtom types. The tricky thing is that adding / removing atoms should also require calling corresponding methods of such GroundedAtom type implementations. One may propose to go further and create another more basic implementation of such a generic class without indexing. In this case, all grounded atom types will have the same (obligatory, but with two default simplified implementations) interface. I'm not sure what are computational overheads and inconveniences in this approach, so I'm not insisting on it.

vsbogd commented 1 year ago

1.) A Space is fundamentally responsible for keeping track of its atoms

Actually this is vague for me. What does "keeping track" specifically means here?

2.) 3.)

I agree with Alexey's answers above.

4.) A)

It is a question to what degree atomspace should be aware of a structure. I see few different degrees of awareness here:

Space is aware that atom has some custom index, space is responsible to pass to the atom's type all the necessary information to support this index (for instance whether atom of this type was removed or added). Space keeps the content of the index as part of itself but space doesn't know how this index works, thus the index itself is a blackbox for the space.
Space implements specific algorithms to index grounded atoms of the specific types (for instance specific kind of space to keep and search images).

In the first case we have more universal space implementations and move grounded atoms indexing into a separate abstraction "grounded atom type", but any space can work with any kind of grounded atom. In the second case we have more specific space implementations and handling grounded atoms with different indexing logic under a single space becomes tricky and requires making space implementation more and more complex.

Thinking about using DAS in the first case DAS implementation need somehow keep index with unknown structure inside and keep it in a distributed manner which is not possible without additional knowledge about internal index structure. In a second case DAS should implement many indexes for the grounded atoms in a distributed manner.

vsbogd commented 1 year ago

adding @andre-senna in the thread

luketpeterson commented 1 year ago

I am seeing a common theme connecting this thread with https://github.com/trueagi-io/hyperon-experimental/issues/408 , https://github.com/trueagi-io/hyperon-experimental/issues/409, the part of this comment concerning unwieldy StateAtom usage: https://github.com/trueagi-io/hyperon-experimental/pull/433#issuecomment-1722134412 , @ngeiswei 's issue here: https://chat.singularitynet.io/chat/pl/kzpzkioextryfnnmas9zzk5c5o and certainly other issues / points of confusion.

Basically they all come down to names for certain grounded atoms, and when a grounded atom is mixed up with a symbol.

This might be controversial, but how about letting grounded atoms have an Optional string name? Spaces resolve symbols matching the name (in the appropriate context) as references to the grounded atom.

vsbogd commented 1 year ago

This might be controversial, but how about letting grounded atoms have an Optional string name?

Assigning some name to the atom can be done implementing #134 (and I was going to implement it). Another question is how to construct and add (= <name> <grounded>) expression for the new grounded atom. It can be easily done from Rust or Python calling space.add_atom() thus I think it makes adding names for the grounded atoms unnecessary. And we could implement a variant of bind! which doesn't modify the tokenizer but instead adds such equality to the atomspace.

Necr0x0Der commented 1 year ago

I am seeing a common theme connecting this thread with...

Yes

This might be controversial, but how about letting grounded atoms have an Optional string name?

Please, see my comment. https://github.com/trueagi-io/hyperon-experimental/issues/409#issuecomment-1742055716

While grounded atoms with optional names is a possible partial solution, I believe that we need to take other aspects into consideration. From the AGI-ish / cognitive architecture perspective, both symbols can be built on top of subsymbolic patterns and subsymbolic data / functions can originate from "compiling" symbolic declarative knowledge. Thus, I'd prefer to have a more flexible and rich mechanisms (possibly partly describable in MeTTa itself) for associating symbols and grounded atoms than having string names or hash-like identifiers optionally attached to grounded atoms.