trueagi-io / hyperon-experimental

MeTTa programming language implementation
https://metta-lang.dev
MIT License
145 stars 49 forks source link

Discussion: Name conflict of the tokens which define grounded atoms on atomspace import #408

Open luketpeterson opened 1 year ago

luketpeterson commented 1 year ago

Background example:

The Rust stdlib contains the following definitions:

    (= (if True $then $else) $then)
    (= (if False $then $else) $else)

The Python stdlib contains this:

        r"True|False": lambda token: ValueAtom(token == 'True', 'Bool')

Loading the Rust stdlib followed by the Python stdlib means that the following fails to resolve:

!(if True 1 0)

Obviously in this case we can just load the modules in the reverse order. But in general, how can we make this better for user-supplied modules and accidental namespace collisions resulting in unintended redeclaration?

Some ideas: 1.) Error / warning messages when an atom is replaced. To avoid unintended namespace collisions. 2.) A syntactic difference between replacing an existing atom and declaring a brand new one 3.) Automatic re-processing of dependent expressions, so that replacing an atom propagates through the whole space *.) Implementing https://github.com/trueagi-io/hyperon-experimental/issues/351 is the right solution for True|False but doesn't generalize

Each of these have semantic implications which might not be desired. So I want to hear opinions of others. However silently orphaning the expressions containing the old atom seems like bad behavior.

Adam-Vandervorst commented 1 year ago

I think there should be a "Meta space" which contains information about the interpreter, the spaces, and grounding. In this way, a repl (in Python, Rust, or something else) can check (by just querying) for things like this occuring, and warn the user. I.e., imho this question resolves to "where do we make meta information available", not "how do we tackle redeclarations in particular".

As for namespaces, in theory this could be regular spaces in the "root" meta space. To do this, we'll have to put work into describing the grounded operations (names, types, docs, maybe fallback/transparency functionality, definition site) in MeTTa itself.

Necr0x0Der commented 1 year ago

@luketpeterson , yes, this is an old problem. We want to have Python booleans and numbers within the Python MeTTa runner for better interoperability. At the same time, we want to have general definitions for some constructions in stdlib independent of the host language. There are very different solutions to this problem. For example, we can have pure symbols for booleans, and have explicit conversion functions to grounded Rust / Python atoms, when someone wants them. We've also been discussing with @vsbogd the possibility to implement automatic conversion of grounded types between Rust and Python spaces. Yet another solution would be to keep everything as symbols and resolve them not at parse-time, but at runtime....

Adam-Vandervorst commented 1 year ago

Automatic conversion just sounds like yet another foot gun @Necr0x0Der. The simple case of integers has Python with arbitrary width integers and Rust where this is not native, or floating point which is taken to IEEE 754 in most new programming languages, while hardware we want to interop with are leaving it behind, and this is not even touching the mess that String interop is. I think we should do much more work during runtime, and not think about it as such anymore. We can introduce phases as we want, but MeTTa is fundamentally long-lived, and we better exploit it in situations as this. What I was arguing for is to not just solve this problem by doing it at runtime, but make the information available to MeTTa programs, where we can handle it in whatever way we like.

vsbogd commented 1 year ago

My understanding this issue is more about name conflict handling, I agree the overall problem is wider but Luke raised separate issue #351 about Python/Rust/whatever grounded data interoperability. For the name conflict resolution we should already have all information inside MeTTa runner.

I don't think 2) make much sense especially as it requires developing additional syntax and so on. I would expect new atom shadows old and either interpreter warns me about this or just silently replaces it. As we have unusual situation where new atom is shadowed by old one I would expect it is a warning/error and thus I would implement 1) before we resolve it properly working on #351.

3) is more complex than 1) I am not even sure we can do this at the moment for general case because we have grounded atoms not tokens in atomspace and it is difficult to find all grounded atoms which corresponds to the old token definition. On the other hand it implements behaviour which is probably expected by the most of users: replaces old token definition by the new one. Replacing atoms doesn't sound easy and I would think in another direction, for example may be we should keep grounded data differently. We could keep reference to the definition in atom and replace this definition when token definition is replaced. Then actual data is constructed only when it is required to pass it into the grounded function and should be immutable probably. May be we need to separate immutable and mutable data and tokens can return only immutable one.

Last paragraph is vague and sounds more like a question of #351. Thus I would suggest implementing 1) as a tactical solution and think about proper grounded data representation in #351.

andre-senna commented 1 year ago

I also think 2) doesn't make sense. Or it does, but it would add unnecessary complexity to syntax. It's quite possible that you don't know if an atom is already defined in a given space but you want to define it anyway.

And I think 3) adds unnecessary complexity to the semantics of atom redefinition. I don't see a clear use-case for this. So I believe 1) (perhaps with possibility for user define the level of warning/error severity) is the way to go here.

patham9 commented 1 year ago

I agree, 1) seems simple and good. As an additional related point, a very convenient feature of MeTTa is that functions can be defined over multiple lines. For instance

(= (test) 1)
(= (test) 2)

could be intentional and not necessarily meant as a re-definition as both lines together are the same as (= (test) (superpose (1 2))) Hence, maybe the solution is to raise a warning if the same thing is defined across multiple lines with other stuff in-between, e.g. the above would not give a warning, but the following would:

(= (test) 1)
(= (other) 3)
(= (test) 2)
Adam-Vandervorst commented 1 year ago

The superpose and two-line variant are equivalent, and imo this equivalence should be enforced whenever superpose is not in a closure.

The interspersed definition you run in quickly when you're automatically generating code, say type classes for ADT's:

> !(derive Functor Pair)
(= (map $f (Pair $x $y)) (Pair ($f $x) ($f $y)))
(= (void (Pair $_ $_)) (Pair () ()))

> !(derive Functor Nat)
(= (map $f (S $x)) (S ($f $x)))
(= (map $f Z) Z)
(= (void (S $_)) (S ()))
(= (void Z) Z)

I propose to tackle MeTTa's long running nature more holistically.

vsbogd commented 1 year ago

The issue is about redefining tokens which generate grounded atoms. Thus it doesn't affect multi-line function declaration.

luketpeterson commented 10 months ago

I think we need to decide on a good general solution, or this will continue to bite module authors. With the module system refactor, I tried to convert the core stdlib into a self-contained MeTTa module and the Python stdlib extensions into another MeTTa module. But this issue make that impossible.

A hack-around solution for stdlib is to load all tokens for both python stdlib and core stdlib, and then to re-load the core stdlib atoms from MeTTa code with the new tokenizer. We can do this for stdlib because we have total control over it, but this approach totally defeats the point of modularity for third-party modules.

One idea is to update the import behavior to re-resolve the atoms upon import, but I don't like this for a number of reasons. 1.) Mainly, the atoms would get updated and re-resolved At import time, but then they could be invalidated by additional tokens, etc. added after the import. also 2.) Currently we don't save the pre-tokenized representation of atoms and we can't guarantee round-trip fidelity between stringified representations and atom representations 3.) It means we need to scan the whole space being imported, as opposed to the current behavior just to create an embedded Space atom.

So I think trying to fix this with import is a step in the wrong direction.

@vsbogd 's suggestion here would fix this particular issue: https://github.com/trueagi-io/hyperon-experimental/issues/409#issuecomment-1695120518 But given the conversation around it here: https://github.com/trueagi-io/hyperon-experimental/issues/409#issuecomment-1742055716 , https://github.com/trueagi-io/hyperon-experimental/issues/427#issuecomment-1742059186 , and elsewhere, I think there may be more we need to consider.

Also related to https://github.com/trueagi-io/hyperon-experimental/issues/510

luketpeterson commented 10 months ago

Working through this, the implementation details are not entirely clear to me. Specifically, considering the "if" example from this issue:

If we chose to replace the pure MeTTa declarations from stdlib (currently)

    (= (if True $then $else) $then)
    (= (if False $then $else) $else)

with a grounded CondOp to represent "if", then I could see how we can postpone the tokenization / parsing of the True / False atom. This would probably be fine for "if", but might be problematic because it effectively creates a "1-way street" where any kind of atom can be an argument to a grounded atom, but only exact grounded atoms can be arguments to pure MeTTa.

But is it possible to (somehow??) keep the pure MeTTa declaration for "if"? For example, something like:

    (= (if %(match-if-can-become True)% $then $else) $then)
    (= (if %(match-if-can-become False)% $then $else) $else)

That would cause the interpreter to match the second atom with any symbol or grounded atom that could be converted into the provided symbol. I don't know whether a conversion API or an equivalence-checking API is the better way to go. Or if we'd want to extend this to be more general. I haven't considered all of the possibilities yet.

Maybe there is already a way to do this elegantly with minimal MeTTa. What do you think @vsbogd ?

vsbogd commented 10 months ago

I think the simplest solution is to divide Rust standard library on three modules:

We already have Python specific tokens module separately. My understanding that such division should be possible even on the level of MeTTa modules according last specification from @Necr0x0Der.

Then we could load different set of libraries for Python and Rust:

It mean we have different libraries for the Python and Rust and load them separately, but on the other hand the will contain mostly the same set of modules, only one module is different. @luketpeterson what do you think?

vsbogd commented 10 months ago

This question is related to the design of the MeTTa modules (#470) but it is raised in this issue and I related to my comment above, so I will put it here.

Before explaining the idea let's explain the task. For example we have three modules A, B and C. A imports B, B imports C. Each module has own unique tokens, let's name them TA, TB and TC. What are the scopes of the tokens? Logically module A uses TA and TB but not TC because TC is not re-exported from B. Module B uses TB and TC. Module C uses TC only. For a sake of clarity I am not mentioning standard library tokens here because we can consider standard library as yet another module imported.

This task is not easy to implement using a single instance of Tokenizer we have now, each of the modules should have own instance of the Tokenizer. At the same time there are two tokenizers: a Tokenizer which is exported by module (for instance module B exports TB) and Tokenizer which is used to parse the code of the module (module B uses union of TB and TC). At the same time module B could re-export tokenizer TC then its exported tokenizer is the union of TB and TC as well.

In order to represent both instances of the Tokenizer in code and don't duplicate the information we could introduce various Tokenizer implementations. Thus the Tokenizer may become a trait with only method find_token. We could introduce two implementations of the trait. First one would be similar to the current Tokenizer structure. It registers tokens and implements the Tokenizer trait (SingleTokenizer). Second one would be union of other Tokenizer implementations (UnionTokenizer). Each module contains two instances of Tokenizer trait: first one to export and second one to parse the module's MeTTa code.

Making a union of Tokenizer instances may lead to the tokens conflict. One way to resolve the conflict is to treat such conflicts into runtime import errors. But then we need to implement an algorithm which finds conflicts between different Regexs, which is not an easy task at least without diving into Regex internals. Other way to resolve a conflict is to make a priority between underlying tokenizers. For instance make first imported tokenizer have less priority than last one. Own module tokenizer should have the most priority.

vsbogd commented 10 months ago

This idea can be applied to the problem above by the following way. We could split tokenizer in standard library to three parts: TCommon, TRust and TPython. Let's say we have modules: Common, Rust, Python and each module exports corresponding tokenizer. Then we could form two standard library modules StdLibRs and StdLibPy each of them uses union tokenizer of Common and Rust or Python and the same MeTTa code of a standard library.

luketpeterson commented 10 months ago

I like your idea about scoping the Tokenizer a lot. I am in total agreement that a "layered" approach to the Tokenizer is probably going to be needed, especially to stop tokens from "leaking" where they shouldn't. This is connected to https://github.com/trueagi-io/hyperon-experimental/issues/511

However, the problem in this github issue isn't solved by the Tokenizer alone. The problem is caused when the MeTTa code of the Rust stdlib is turned into atoms. The Rust stdlib's space ends up includes an expression that contains a Grounded True atom, created from the Rust stdlib's tokenizer. When we link the sub-space using a nested space atom, the original grounded True atom doesn't match the atom that True will parse into using the current top-level Tokenizer.

So to fix it, it means we need to do one of:

A.) store the True atom in the space such that it gets reinterpreted using the current tokenizer B.) re-parse the stdlib MeTTa code (and all imported modules' code) in light of the current tokenizer, instead of importing the atoms using the embedded space linking C.) Make the True atom from the Rust stdlib match the True atom from the Python stdlib using the grounded atom match function.

In this specific case, C.) is definitely the best answer, IMO, (and generalizing the solution by implementing https://github.com/trueagi-io/hyperon-experimental/issues/351) This works nicely for grounded values.

But is that enough to catch all the practical problems (the ones that will trip people up) related to conflicts between atoms and tokens?

Necr0x0Der commented 10 months ago

I'd add my 5 cents:

This task is not easy to implement using a single instance of Tokenizer we have now

Yes. That's why I was inclined to consider modules not as spaces but as runners. There was a simple prototype https://github.com/trueagi-io/hyperon-experimental/tree/main/python/sandbox/resolve exploring this idea. In general, this issue should be considered together with the design of namespaces, and importing to &self may indeed cause conflicts, but besides stdlib it should be not a good practice to import the whole module to &self (like from lib import * in Python), and we may either import particular tokens (from lib import Name) or import a module as a namespace and access its tokens via namespaces.

vsbogd commented 10 months ago

When we link the sub-space using a nested space atom, the original grounded True atom doesn't match the atom that True will parse into using the current top-level Tokenizer.

I probably was not specific in my comment above. What I mean is that we have two standard libraries which has different tokenizers (built from the common and lang specific part) and same MeTTa code parsed by tokenizers. Thus True in Rust stdlib is a Rust boolean value, True in Python stdlib is a Python boolean value. When some new MeTTa module is imported then lang specific tokenizer is used and boolean value corresponds to the platform is used.

vsbogd commented 10 months ago

That's why I was inclined to consider modules not as spaces but as runners.

I believe considering modules as runners may doesn't allow us to resolve issues like #354. We need some runner which is able to execute functions from every imported space at the same time. I would suggest instead making atoms unique on import (resolving conflicts by raising a runtime error or according to priority) and then use a global runner which is not affected by conflicts because atoms are unique.

Necr0x0Der commented 10 months ago

One doesn't exclude another. If we consider namespaces, we'll still see the need to have separated tokenizers for modules, because module A can import tokens from module B, while module C may not import them, so the tokens from module B should not be in the global runner.