[New Feature]: Support for Ollama and local models

Cavalletta98 commented 3 months ago

Have you checked for duplicate issue tickets?

Yes - I've already checked

Have you considered alternative solutions to your feature request?

Yes - and alternatives don't suffice

Is your feature request related to any problems? Please help us understand if so, including linking to any other issue tickets.

No

A clear and concise description of your request.

Hi, i'm trying to use Ollama with local LLM's with ROSA. I already tested ROSA with OpenAI an it works very well. With Ollama seems that i does not call the tools. Have you in plan to integrate ROSA with Ollama and local LLM's?

RobRoyce commented 2 months ago

Thanks for the comment. Local models are high on the list of priorities right now.

Looking at the documentation for ChatOllama, it does support tool calling. The method for binding tools also looks identical to what we're doing in ROSA.

Would you be willing to share your code, logs, output, or error messages so we can help diagnose?

RobRoyce commented 2 months ago

Note: tried Ollama with Llama3.1 8B, with very poor results (see below). Going to test 70B model soon, hopefully with better results.

RobRoyce commented 2 months ago

Interestingly, Llama3.1 8B does get close to working (see below). This might indicate that, while the model does choose the correct tool and parameters, it isn't responding with the correct output (e.g. format required by the LangChain AgentExecutor).

Cavalletta98 commented 2 months ago

Thanks for the test. I was testing the llama 3.1 8b but with custom tools different from the turtle example. I Just noticed that ,while with OpenAi the tools are correctly called,with llama 3.1 It answers with explaination without calling the tools. I think the problem Is the model and not ChatOllama that should support tools

Cavalletta98 commented 2 months ago

from rosa import ROSA, RobotSystemPrompts
from langchain_ollama import ChatOllama
from langchain.agents import tool
import os
from rich.console import Console
from rich.markdown import Markdown
from rich.prompt import Prompt
from rich.text import Text

llm = ChatOllama(model="llama3.1",temperature=0.0)

@tool
def move_forward(distance: float) -> str:
    """
    Move the robot forward by the specified distance.

    :param distance: The distance to move the robot forward.
    """
    # Your code here ...
    print("I'm moving of ",distance)
    return f"Moving forward by {distance} units."

prompts = RobotSystemPrompts(
    embodiment_and_persona="You are a cool robot that can move forward."
)

rosa = rosa = ROSA(ros_version=2, llm=llm, tools=[move_forward], prompts=prompts)

console = Console()
greeting = Text(
            "\nHi! I'm the ROSAI agent 🤖. How can I help you today?\n"
        )
greeting.stylize("frame bold blue")

console.print(greeting)
while True:
    user_input = Prompt.ask("Agent Chat")
    if user_input == "exit":
        console.print("Bye Bye")
        break
    output = rosa.invoke(user_input)
    console.print(Markdown(output))

This is the piece of code that i'm currently testing and this is an output to a request immagine

RobRoyce commented 2 months ago

Thanks for the feedback.

I had a feeling that the number of base tools was too much for an 8b model, and I was correct. If you remove all but 4-5 tools, it actually works with the Llama3.1 8b (I tried with ros2 node list, ros2 topic list, and a custom move_forward tool).

However, 4 or 5 tools is clearly not enough for a general purpose agent. I am currently testing Llama3.1 70b and will report back soon.

Either way, we'll provide a new, more intuitive interface for model selection, which will include Ollama.

RobRoyce commented 2 months ago

Update: ROSA is working with Llama 3.1 8b!

Not sure why I didn't catch this the first time around, but the default behavior for Ollama with llama3.1:8b is to set the context size to 2K. Well, the number of tokens for base ROSA is closer to 4K.

It turns out that you can set the context size like so:

llm = ChatOllama(
    model="llama3.1",
    num_ctx=8192,
    temperature=0.0,
)

Doing it this way, we do not need to remove any of the tools or make any modifications to the agent whatsoever. In addition, I tested Llama 3.1 70b, and it also works for ROSA. That said, inference time is significantly higher.

Benchmarks

I wanted to compare performance between the two models, both for latency and quality of results. I tried this on a relatively capable machine with an RTX 4060. The discrepency is very likely due to the fact that the GPU only has 8GB of memory and therefore the 70b model experiences extremely high memory transfer time bottlenecks. All tests were performed using the TurtleSim demo (caveat: the TurtleSim demo has a very small list of topics, nodes, etc., YMMV)

Query: Reset the sim
Difficulty: Very Low

Model	1st Tool Call	Final Response	Result
`llama3.1:8b`	10s	11.34s	Success
`llama3.1:70b`	1m55s	2m52s	Success

Query: Give me a list of nodes
Difficulty: Low

Model	1st Tool Call	Final Response	Result
`llama3.1:8b`	5.8s	7s	Success
`llama3.1:70b`	2m16s	2m52s	Success

Query: Give me a list of nodes, topics, services, params, and log files
Difficulty: Medium

Model	1st Tool Call	Final Response	Result
`llama3.1:8b`	12.3s	18.2s	Success
`llama3.1:70b`	1m56s	11m32s	Success

Query: Move forward 1 unit, then turn left 45 degrees, then move forward 2 units
Difficulty: High

Model	1st Tool Call	Final Response	Result	Reason
`llama3.1:8b`	8.24s	10.37s	Fail	Failed to convert degrees to radians before turning
`llama3.1:70b`	5m18s	6m24s	Fail	Incorrect sequence of tool calls

Query: Draw a 5-point star
Difficulty: Very High

Model	1st Tool Call	Final Response	Result	Reason
`llama3.1:8b`	6.8s	18.8s	Fail	Incorrect parameters used
`llama3.1:70b`	2m46s	7m48s	Fail	Incorrect sequence of tool calls

Note: This particular query results in several intermediate steps, each of which must happen in the correct order.

Notes

The 70b model is significantly slower than the 8b model, but it does work.
The 8b model works well with ROSA, but may not be sufficient for more complex tools.
The 70b model is likely to be more accurate, but is not practical for real-time applications unless you have a very powerful GPU.
Both models can handle queries that require multiple tool calls.
Time to 1st tool call is usually higher than time to subsequent tool calls for the same query.
Streaming does not work with ChatOllama when using it for agents. This is a LangChain limitation.

Conclusion

For most applications (especially using only the core ROSA class without custom tools), the 8b model is likely to be sufficient. If you need higher accuracy, you can use the 70b model, but be prepared for significantly higher latency. If you need to use the 70b model, you may need to consider a more powerful GPU (A6000 or higher) or a dedicated device with global memory (e.g. Jetson AGX Orin).

[!Important] Make sure you set temperature=0.0 and num_ctx >= 8192 for both models.

Cavalletta98 commented 2 months ago

Hi @RobRoyce thank you su much for the update. Have you planned to compare with other models to understand if the performance will be better than llama? I'm asking because could be interesting to understand if there is a model that performs well with ROSA base tools and other tools that a user could add. Just tu understand if there is a limitation on the number of tools that a user can add and if it is related to a model

nasa-jpl / rosa