Different optimization traces & optim on a shared code

doxav commented 1 week ago

Is it possible to support different traces & optimization on shared functions/nodes using decorators ? What would be the best practice ?

In my case, I need to do different traces & optimization on shared code files/functions:

within each of my LLM agents.
in a workflow involving all my different agents
in a workflow involving 2 of my agents (overlapping the one above and one below)
in a workflow involving 3 of my agents (overlapping the two above)

Thanks a lot

chinganc commented 1 week ago

Good question! In Trace, you can have multiple optimizers (each attends to separate parameters). The paths where those parameter are used can be shared. After calling backward, each optimizer will receive their respective minimal subgraph.

Is this what you meant? It will be helpful if you can help clarify your scenario. E.g.,

Do you mean using different optimizers for different parameters? Do you mean optimizing them in different frequencies? Do different traces touch on shared parameters/

I think it will be most clear if you can write down a toy example of such a case, so I can better help :)

doxav commented 1 week ago

Hi, thanks a lot. Here is a toy example:

from opto.trace import bundle, node
from opto.optimizers import OptoPrime

# Define a shared function with optimization capability
@bundle(trainable=True)
def shared_function(data, param):
    """Shared function processes data with parameter adjustments."""
    return data * param + 1

# Define the AgentMonitoring class with its own optimizer
class AgentMonitoring:
    def __init__(self, name, param1, param2):
        self.name = name
        self.param1 = node(param1, trainable=True)
        self.param2 = node(param2, trainable=True)
        self.optimizer = OptoPrime([self.param1, self.param2])

    @bundle()
    def process_step1(self, input_data):
        """First step of agent processing."""
        return shared_function(input_data, self.param1)

    @bundle()
    def process_step2(self, intermediate_data):
        """Second step of agent processing."""
        return shared_function(intermediate_data, self.param2)

    @bundle()
    def process(self, input_data):
        """Complete processing pipeline for the agent."""
        intermediate = self.process_step1(input_data)
        return self.process_step2(intermediate)

    def optimize(self, feedback):
        """Optimize agent parameters based on feedback."""
        self.optimizer.backward(self.param1, feedback)
        self.optimizer.backward(self.param2, feedback)
        self.optimizer.step()

# Define inter-agent workflow optimization with specific optimizer
@bundle()
def agent_interaction_loop(agent1, agent2, data, iterations, optimizer, optimize_every):
    """Defines workflow between two agents with iterative refinement."""
    result = data
    for i in range(iterations):
        step1 = agent1.process(result)
        result = agent2.process(step1)

        if (i + 1) % optimize_every == 0:
            print(f"Inter-agent optimization step at iteration {i + 1} for {agent1.name}-{agent2.name}")
            optimizer.backward(result, "Output should be larger.")
            optimizer.step()
    return result

# Define multi-agent workflow optimization with specific inter-agent optimizers
@bundle()
def multi_agent_workflow_loop(agents, data, iterations, workflow_optimizer, optimize_every):
    """Multi-agent collaborative optimization with diverse actions."""
    result = data
    interaction_optimizers = [
        OptoPrime([agents[j].param1, agents[j].param2, agents[j + 1].param1, agents[j + 1].param2])
        for j in range(len(agents) - 1)
    ]

    for i in range(iterations):
        # Inter-agent interactions
        for j, optimizer in enumerate(interaction_optimizers):
            result = agent_interaction_loop(
                agents[j], agents[j + 1], result, iterations=1, optimizer=optimizer, optimize_every=1
            )

        # Independent agent optimizations
        for agent in agents:
            result = agent.process(result)
            agent.optimize("Individual agent optimization feedback.")

        # Periodic optimization for the entire workflow
        if (i + 1) % optimize_every == 0:
            print(f"Multi-agent optimization step at iteration {i + 1}")
            workflow_optimizer.backward(result, "Workflow optimization feedback.")
            workflow_optimizer.step()
    return result

# Set up agents with distinct parameters
agent1 = AgentMonitoring("Agent1", param1=1.0, param2=1.5)
agent2 = AgentMonitoring("Agent2", param1=2.0, param2=2.5)
agent3 = AgentMonitoring("Agent3", param1=3.0, param2=3.5)

# Initialize data and workflow optimizer
data = node(10, trainable=False)
workflow_optimizer = OptoPrime([
    agent1.param1, agent1.param2,
    agent2.param1, agent2.param2,
    agent3.param1, agent3.param2
])

# Run multi-agent workflow with specific inter-agent optimizers
result = multi_agent_workflow_loop(
    [agent1, agent2, agent3],
    data,
    iterations=10,
    workflow_optimizer=workflow_optimizer,
    optimize_every=5
)
print(f"Final result of multi-agent workflow: {result.data}")

# Print optimized parameters
print(f"Optimized Parameters:")
print(f"Agent1: param1={agent1.param1.data}, param2={agent1.param2.data}")
print(f"Agent2: param1={agent2.param1.data}, param2={agent2.param2.data}")
print(f"Agent3: param1={agent3.param1.data}, param2={agent3.param2.data}")

chinganc commented 1 week ago

This is really helpful. I haven't tried running the code, but basically you want the same parameters to be optimized by multiple optimizers with different feedback, right? For example, by workflow_optimizer,interaction_optimizers, and the optimizer in agent. That usage shouldn't cause issue in Trace. But one thing seems to be missing in the example is calling "zero_feedback" at the right place e.g., before each backward (I'm not sure of that is intentional). Also, I think you need to setretain_graph=True in backward (it's set to False by default) in order to call backward multiple times for the same graph. So likely you need to something like, similar to how you would use PyTorch,

    output = compute(input) # the optimizers attend to parameters in this process

    optimizer1.zero_feedback()
    optimizer1.backward(output, feedback1, retain_graph=True)
    optimizer1.step()

    optimizer2.zero_feedback()
    optimizer2.backward(output, feedback2, retain_graph=True)
    optimizer2.step()

    ....

    optimizerN.zero_feedback()
    optimizerN.backward(output, feedback2, retain_graph=False)
    optimizerN.step()

Another thing off in the example is that "shared_function" is not really optimized, as its parameter is not given to any optimizer. Hope this helps.

doxav commented 6 days ago

Thanks so much for your detailed response and for pointing out those issues in this toy code. You're right—I missed calling zero_feedback before each backward call, and I wasn't aware that I needed to set retain_graph=True when performing multiple backward passes on the same graph.

I understand that retain_graph=True will allow to call backward more than 1 time on the same node. Therefore, is there any unexpected behaviour after being called several times by for example optimizer2 ? Should it be prevented by calling zero_feedback() at the right time ?

Regarding the shared_function, you're correct that its parameter is not being optimized. It is a toy code and I do not yet need to optimize the code of a shared function but it will happen. Therefore I am wondering if you already had such a case ? Would calling step() by optimizer2 update the function's code based on the latest version of the code updated by any optimizer calling step() or the latest call of step() only from optimizer2 ?

I have to admit that I am still not clear about the status of shared nodes when trained or just used by several optimizers.

chinganc commented 4 days ago

I understand that retain_graph=True will allow to call backward more than 1 time on the same node. Therefore, is there any unexpected behaviour after being called several times by for example optimizer2 ? Should it be prevented by calling zero_feedback() at the right time ?

Calling backward each time will add a new propagated graph to the parameter's feedback dict. (feedback is a dict of list, where the keys are child nodes and the list denotes the propagated feedback received from the child). So when backward is called multiple times with retain_graph=True, the list may be non-singleton.

Whether this causes issues depends on how the optimizer aggregates the feedback dict. Currently, the optimizer implementation performs a full aggregation, combining the feedback (graphs) of each child and then the resulting aggregated graphes from different children finally as a single graph. We currently assume graphs can only be combined when their output feedback is identical (otherwise, it will throw an error).

So "is there any unexpected behaviour after being called several times by for example optimizer2 ?" Likely in your scenario you will see a runtime error if the feedback provided in each backward is different and you don't call zero_feedback. If you have zero_feedback called before the next backward, the previous logged propagated feedback (graph) in the feedback dict will be cleared, so you can provide different output feedback in calling backward.

Would calling step() by optimizer2 update the function's code based on the latest version of the code updated by any optimizer calling step() or the latest call of step() only from optimizer2 ?

If say shared_function is linked to optimizer 1 and optimizer 2, and you performed optimizer1.step before. Then optimizer2 will see the latest version of shared_function updated by optimizer 1. When a node is shared across optimizers, it is still a single object. Just that multiple optimizers have access to that object so they can perform update on it. If you want to decouple them, you need to create different node instances.

Hope this answers your questions.

doxav commented 1 day ago

Thank you

microsoft / Trace

Different optimization traces & optim on a shared code #17