trueagi-io / hyperon-experimental

MeTTa programming language implementation
https://metta-lang.dev
MIT License
148 stars 49 forks source link

Performance + Partitioning of Spaces Question #774

Open nworb999 opened 1 month ago

nworb999 commented 1 month ago

Hello,

I'm considering using MeTTa for a conversational AI application and have some questions about its performance with large datasets.

In the OpenCog Atomspace Metagraphs paper, it's mentioned that:

We’ve worked with natural language, genomics and robotics datasets.

This was in the context of discussing how "the arity of hyperedges is typically small." However, I couldn't find much information about performance considerations in the MeTTa documentation.

  1. Can MeTTa promise similar in-RAM performance as described in the original Atomspace paper?
  2. Are there any benchmarks available for MeTTa, particularly for large datasets or game AI applications?
  3. I was planning to use Redis as a key-value store to partition some of my spaces thematically (e.g., storing the contents of people_kb.metta as a value). Would this approach be beneficial, or is it only necessary for exceptionally large datasets?
  4. Are there any best practices or recommendations for optimizing MeTTa performance with large datasets?

Any insights or resources you can provide would be greatly appreciated before I start implementing my project.

Thank you for your time and assistance!

vsbogd commented 1 month ago

Hi @nworb999 ,

  1. Can MeTTa promise similar in-RAM performance as described in the original Atomspace paper?

Hyperon-experimental codebase performance was significantly improved recently, but there is still room for improvement of cause. At the moment I am working on better implementation of the in-memory atomspace which uses more compact representation and should work faster for large knowledge bases.

It is worth to mention two other implementations which were focused on performance from the very beginning: https://github.com/trueagi-io/metta-wam and https://github.com/trueagi-io/MORK @TeamSPoon and @Adam-Vandervorst can say more about them.

  1. Are there any benchmarks available for MeTTa, particularly for large datasets or game AI applications?

Hyperon-experimental has couple of performance tests more microbenchmarks really you can find them at https://github.com/trueagi-io/hyperon-experimental/tree/main/lib/benches. I think I will be able to provide something more solid after finishing of new atomspace implementation.

  1. I was planning to use Redis as a key-value store to partition some of my spaces thematically (e.g., storing the contents of people_kb.metta as a value). Would this approach be beneficial, or is it only necessary for exceptionally large datasets?

Do you mean Redis will keep different atomspaces under different keys and you want to load/unload them into memory before processing? Not sure what goal do you pursue here. Hyperon-experimental is not able to work with Redis out of the box. Do you mean loading knowledge base into memory from Redis should be faster then loading it from the disk?

  1. Are there any best practices or recommendations for optimizing MeTTa performance with large datasets?

There is DAS (Distributed AtomSpace) project https://github.com/singnet/das which is an implementation of the distributed atomspace storage. It is integrated to hyperon-experimental to some extent. @andre-senna who is lead of the DAS project and @CICS-Oleg who integrates it with hyperon-experimental can say more about this.

Adam-Vandervorst commented 1 month ago

Hi @nworb999 , do you some example MeTTa files and queries for your application? Would love to work through them with MORK.

nworb999 commented 1 month ago

Thank you so much for your responses!

Do you mean Redis will keep different atomspaces under different keys and you want to load/unload them into memory before processing?

Yes, the idea is to add a service that grabs relevant atomspaces at run-time for different users when the conversation starts -- different users will have different "memories" from past conversations as well as a shared pool of general information that will be run on start-up (main and memory in the example below).

I know that loading from Redis won't necessarily be faster than loading it from a disk, I was more curious whether there was a more conventional way of partitioning atomspaces by user in a "database" for an application with multiple users. I've only worked with SQL/NoSQL databases in the past for production applications, and haven't used something like atomspaces before. If we have 100,000 users, I wouldn't normally have 100,000 growing flat files locally at all times.

import redis
from hyperon import MeTTa

metta = MeTTa()

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def get_metta_script(script_name):
    """Retrieve a MeTTa script from Redis."""
    script_content = redis_client.get(f"metta_script:{script_name}")
    return script_content

main_script = get_metta_script("main")

memory_script = get_metta_script("memory")

metta.run(memory_script)
metta.run(main_script)

However, it seems like the level of complexity + scale of our data will be small by MeTTa's standards (compared to robotics datasets, for example).

Here is some stubbed-out data, I haven't finished the queries for it yet but they will be pretty basic CRUD capabilities.

(GameType Tag)

(GameRule Tag (Tagged) (switch))

(GameRole Tag Predator)
(GameRole Tag Prey)

(GameInstance Tag1)
(GameRoleInstance Tag1 Predator Tony)
(GameRoleInstance Tag1 Prey Polly)

(GameAction Tag (switch) ( ( (Predator $x) (Prey $y)) ( (Predator $y) (Prey $x))))

(GameEvent Tag1 (Tagged))

; (GameRoleInstance Tag1 Predator Polly) overwrite with new values
; (GameRoleInstance Tag1 Prey Tony)

An example of something we would do is manage facts about objects, create a game, create game rules, game roles, game instance + role instances, game actions, and game events. Also fetching each of these types of data, and updating GameRoleInstances. The functionality is very basic, but scaled to a hundred-thousand "rows", I was wondering if there was an argument for thematically sharding atomspaces to only be pulled in-memory (metta.run()) by the application when needed.

That was what we had in mind for Redis as well, on top of storing user's specific knowledge bases separately.

vsbogd commented 1 month ago

I was more curious whether there was a more conventional way of partitioning atomspaces by user in a "database" for an application with multiple users.

One way is to keep database for each user in a separate atomspace. And these atomspaces can be inserted into a main atomspace as atoms as well. There is a small example written in MeTTa to demonstrate the idea:

; add new instance of the knowledge base for the user by name
(= (add-user $name)
  (let $space (new-space)
    (add-atom &self (User $name $space))) )

; add passed data into user's knowledge base
(= (write-user $name $data)
   (let $space (match &self (User $name $s) $s)
     (add-atom $space $data)) )

; gets data from the user's knowledge base
(= (read-user $name $query $template)
   (let $space (match &self (User $name $s) $s)
     (match $space $query $template)) )

; add Alice's KB
!(add-user Alice)
; add Bob's KB
!(add-user Bob)

; add age of the Alice info
!(write-user Alice (Age 20)) 
; add age of the Bob info
!(write-user Bob (Age 22))

; read Alice's age
!(read-user Alice (Age $age) (Age Alice $age))
; read Bob's age
!(read-user Bob (Age $age) (Age Bob $age))

If the root atomspace is kept in DAS (which I mentioned above) it should provide persistence out of the box and you should not bother about atomspace size. I think @andre-senna or someone from DAS team or may be @CICS-Oleg could explain whether this example works with current DAS version.

If root atomspace is kept inside in-memory space then one need to persist it manually. Unfortunately I think persisting nested atomspace doesn't work out of the box but it should be relatively easy to write it in Python I believe.

I was wondering if there was an argument for thematically sharding atomspaces to only be pulled in-memory (metta.run()) by the application when needed.

Yes, it is possible way to manage it if size of the data is too big. One can write a code similar to the example above in Python. It should load atomspace with user's data and add it as an atom into the root atomspace for instance when user opens session. And it should delete corresponding atomspace atom from the root atomspace when user closes session. This should work more correctly than metta.run() I believe.

CICS-Oleg commented 1 month ago

I think @andre-senna or someone from DAS team or may be @CICS-Oleg could explain whether this example works with current DAS version.

We are testing consistency of current vesions of metta and DAS ATM, but it seems like those examples should work.

andre-senna commented 1 month ago

@nworb999 @CICS-Oleg @vsbogd

We are testing consistency of current vesions of metta and DAS ATM, but it seems like those examples should work.

Yes, I also think so. However, I should remark that DAS is still experimental code (even compared to the MeTTa interpreter) so its public API is very unstable. The main side-effect of this is that integration with the MeTTa interpreter is also unstable. We expect to have a reasonably stable version of the API by the EOY. You can read more info about the ideas behind DAS here: https://github.com/singnet/das and our current planned development roadmap here: https://github.com/singnet/das/discussions/41

nworb999 commented 1 month ago

Thank you all!