projectmesa / mesa

Mesa is an open-source Python library for agent-based modeling, ideal for simulating complex systems and exploring emergent behaviors.
https://mesa.readthedocs.io
Apache License 2.0
2.47k stars 874 forks source link

Multi-Agent data collection #348

Closed Corvince closed 1 month ago

Corvince commented 7 years ago

Dear all,

This is actually how I came across this issue: I wanted to activate different agent types sequentially (but both randomly), so I used two different schedulers, but this broke the data collection. Currently, agent reporters get their agents hard-coded from model.scheduler.agents, assuming it exists and failing if your scheduler is named differently. One way to fix this would be to (optionally?) supply the scheduler to the DataCollector.

The downside is that if you have agents that are not part of any schedule you still can't collect data for them. That's already a problem right now, so it wouldn't worsen the situation, but maybe someone has a better long-term solution to this?

Also, if you use the same scheduler, there seems to be no way to collect data from different agents. If you for example want to count the wealth of some of your agents, but not all agents have a wealth attribute, it fails.

dmasad commented 7 years ago

That's an interesting question.

My inclination is to say that since the scheduler keeps track of time in a model, models should have just one scheduler. Check out the custom scheduler in the Wolf-Sheep example (if you haven't already) to see how you can activate different types of agents in different orders.

That doesn't solve the problem of how to handle data collection for heterogeneous attributes, though. The easy but ugly way is to give all agents the attribute, and just set some of them to 0, or None, or some similar 'N/A' value.

I think the better way would be to give the DataCollector a default 'collect attribute' behavior, which would also let us get away from some of the ugliness with passing lambdas, etc. As part of that, the DataCollector would ensure that agents had the appropriate attribute, and only collect it if they did.

Corvince commented 7 years ago

Aha! I did not look into the example, because I thought the order is not important for the wolf-sheep example. And I did not think about implementing a custom scheduler. But after reconsidering I strongly agree that there should be exactly one scheduler per model.

Regarding the DataCollector, a "collect_attribute" would indeed be nice and more intuitive than the lambda functions. But I would still lean towards additionally defining the agent-type. Staying with the Wolf-Sheep example, one might be interested only in the position of the wolves, but querying the pos-attribute would still query all agents.

Corvince commented 7 years ago

To advance on this, in the datacollection module I modified the _new_agent_reporter function and added the _collect_attribute function as follows:

def _new_agent_reporter(self, reporter_name, reporter_function=None):
    """ Add a new agent-level reporter to collect.

    Args:
        reporter_name: Name of the agent-level variable to collect.
        reporter_function: Function object that returns the variable when
                           given an agent object.

    """
    if isinstance(reporter_function, str):
        reporter_function = self._collect_attribute(reporter_function)
    self.agent_reporters[reporter_name] = reporter_function
    self.agent_vars[reporter_name] = []

def _collect_attribute(self, attribute):
    """ return a reporter function that gets an attribute with the name of
    the reporter, if an agent has that attribute
    """
    def reporter_function(agent):
        if hasattr(agent, attribute):
            return getattr(agent, attribute)
    return reporter_function

So instead of calling the agent-reporter with something like {"position": lambda a: a.pos}, we can use {"position": "pos"}. Maybe a bit less ugly and with the added benefit that it only collects the attribute if it is available.

What do you think?

Corvince commented 7 years ago

I just stumbled upon this issue again in my current model and I still think my last comment offers a nice solution. If you think this is a good way I will create a test for this and submit a PR.

ihopethiswillfi commented 6 years ago

I ran into this as well and used another approach.

Example:

class King(Agent):
    self.uid
    self.wealth

class Bird(Agent):
   self.uid
   self.color

What I did was simply altering the uids. E.g. all kings would have uids ['k0', 'k1', ...] and birds would have ['b0', 'b1', ...]. You get the idea.

The agent reporter would then look like: "wealth": lambda x: x.wealth if x.uid[:1] == 'k' else None

I didn't really explore the above solution from Corvince. But I just wanted to add mine here, which seems to work well for me and is stupidly simple.

An alternative to changing uids would be to simply add a property to the class. Something like Agent.type = 'king', and then verifying this when you collect data.

Corvince commented 5 years ago

Nowadays the data-collector simply returns "None" if an attribute doesn't exist

philip928lin commented 1 year ago

@Corvince I encounter attribute error given that some of the agents do not have certain attributes. "AttributeError: 'Aquifer' object has no attribute 'satisfaction'" According to your previous response, I expect the model return None instead of raising error. Could you help me to inspect this issue?

rht commented 1 year ago

There is https://github.com/projectmesa/mesa/pull/1702/files, which adds an optional argument exclude_none_values.

exclude_none_values: Boolean of whether to drop records which values
            are None, in the final result.

This is the only documentation of the new feature, and so there hasn't been proper guide written for this yet.

If you

  1. enable that option
  2. replace your function with lambda a: getattr(a, "satisfaction", None)

It should automatically ignores agents that don't have satisfaction as an attribute. Pull request to document this feature in https://github.com/projectmesa/mesa/blob/main/docs/useful-snippets/snippets.rst would be appreciated.

philip928lin commented 1 year ago

Thank you for your instruction. Those instructions let the code run without an error message, but the output of the agent dataframe is not correct. In the figure below, Agent ID with "w" is one agent type, and "agt" is another agent type. None should appear in the "Sa" column which is filled with "w1" and "w2" that should be in the "ID" column.

Do you mind to guide me again?

[update] It turns output I should disable "exclude_none_values" and only implement the lambda function you suggested. But, this also means that mesa did not automatically return none with the attribute that does not exist. We still need to manually do this by using the lambda function.

image

rht commented 1 year ago

That sounds like a serious bug. Do you have a minimal code reproducer, so I can tinker with it and fix it?

philip928lin commented 1 year ago

Hi @rht, Definitely~ Please see the following. However, I would hope that this return None feature could be the default of the mesa without using lambda.

import mesa

class MyAgent1(mesa.Agent):
    def __init__(self, name, model):
        super().__init__(name, model)
        self.name = name
        self.agt_type = "agt_type1"
        self.satisfication = 1

    def step(self, run=True):
        pass

class MyAgent2(mesa.Agent):
    def __init__(self, name, model):
        super().__init__(name, model)
        self.name = name
        self.agt_type = 'agt_type2'

    def step(self, run=True):
        pass

class MyModel(mesa.Model):
    def __init__(self, n_agents):
        super().__init__()
        self.schedule = mesa.time.BaseScheduler(self)
        for i in range(n_agents):
            self.schedule.add(MyAgent1(f"A{i}", self))
            self.schedule.add(MyAgent2(f"B{i}", self))

        self.datacollector = mesa.DataCollector(
            model_reporters={},
            agent_reporters={"satisfication": lambda a: getattr(a, "satisfication", None),
                             "unique_id": lambda a: getattr(a, "unique_id", None)},                 
            exclude_none_values=True
        )
    def step(self):
        self.schedule.step()
        self.datacollector.collect(self)

m = MyModel(3)
m.step()
m.step()
m.step()

df_agts = m.datacollector.get_agent_vars_dataframe()
Corvince commented 1 year ago

@Corvince I encounter attribute error given that some of the agents do not have certain attributes. "AttributeError: 'Aquifer' object has no attribute 'satisfaction'" According to your previous response, I expect the model return None instead of raising error. Could you help me to inspect this issue?

Mesa only automatically returns None if your data collector only consists of string collectors. I.e. for your model

agent_reporters={"satisfication": "satisfication", 
                             "unique_id": "unique_id"}                

At some point this was planned to be the main way to collect Attributes (because it is also the fastest), but custom functions are still around so I kind of closed this issue to early.

Corvince commented 1 year ago

Thank you for your instruction. Those instructions let the code run without an error message, but the output of the agent dataframe is not correct. In the figure below, Agent ID with "w" is one agent type, and "agt" is another agent type. None should appear in the "Sa" column which is filled with "w1" and "w2" that should be in the "ID" column.

Do you mind to guide me again?

[update] It turns output I should disable "exclude_none_values" and only implement the lambda function you suggested. But, this also means that mesa did not automatically return none with the attribute that does not exist. We still need to manually do this by using the lambda function.

image

Yes this is indeed a bug. Looking at the code of #1702 now we can see that it completely removes None values. But of course that leaves blank spaces in the Data frame, so the values move to the left (because there are too few values to fill the DF and no indication where values should go). Not sure how that option was supposed to work. @rht?

You can also see in the discussion of #1702 that the feature alone wasn't meant to remove the need for handling Attribute errors on the user side. Sorry for that.

rht commented 1 year ago

It looks like we have to make a choice:

We could use dict instead of tuple for individual agent records, so the order and size are not important, but a dict consumes more RAM than a tuple.

I think one solution would be to transpose the data collection

agent_records = {}
for k, func in self.agent_reporters.items():
    record = tuple((agent.unique_id, r) for agent in model.schedule.agents if r := func(agent) is not None)
    agent_records[k] = record

The records can be merged into 1 DF via the unique_id as an index.

Corvince commented 1 year ago

Maybe I am being dumb right now, but what was the purpose of removing None values in the first place? It doesn't facilitate multi-agent data collection. So I think the choice should be obvious to not exclude None values but receive the correct DF. What would be the advantage in receiving a wrong dataframe, just to save one from the slight inconvenience (?) of having lots of None values. Am I missing something here?

rht commented 1 year ago

It wasn't raised as a GH issue, but @tpike3 encountered OOM when running Sugarscape G1MT on Colab (see https://github.com/projectmesa/mesa/pull/1702#issuecomment-1560121272). I suppose the storage problem here is due to Python's dict of list growing with list size even though the constituents are None's. Another option: maybe a DF for self._agent_records consumes less RAM, where a row is added to the DF for every data collection step.

Corvince commented 1 year ago

It wasn't raised as a GH issue, but @tpike3 encountered OOM when running Sugarscape G1MT on Colab (see https://github.com/projectmesa/mesa/pull/1702#issuecomment-1560121272). I suppose the storage problem here is due to Python's dict of list growing with list size even though the constituents are None's. Another option: maybe a DF for self._agent_records consumes less RAM, where a row is added to the DF for every data collection step.

But then we really need to clear things up. Because until now I thought the remove None value function is somehow related to multi agent data collection, because it is discussed here and in #1702, which "fixed" #1419, which was also related to multi agent data collection.

If this is just about resolving that memory issue then this needs to be further investigated. Because it sounds very strange that removing some None values solves this. None values themselves take up nearly no memory. And I don't know which "dict of list" you are referring to, but yes something like that must be going on. But it still sounds fishy, since colab has 13GB of ram, more than most consumer hardware. So I wonder why this hasn't been encountered previously.

But right now we should focus on resolving the bug found by @philip928lin because that might really mess up some people's research.

tpike3 commented 1 year ago

@Corvince I had a very long explanation, but as I am digging in I am finding inconsistencies in my understanding, so I will need to dig into this some more. Regardless, when updating the sugarscape with traders the memory issue became the code was collecting ~2500 none values each step for the sugar and space which start to break the Colab memory. Sugarscape examples are here](https://github.com/SFIComplexityExplorer/Mesa-ABM-Tutorial/tree/main). I still need to updates for Mesa 2.0 but I think I will need to work through this issue first.

Short version appreciating None take up a very small amount of memory, when you have agents at each grid cell like plant life and collect against them it still becomes problematic.

Corvince commented 1 year ago

Its still hard to imagine, would be great if you could look into this. For reference (and for the fun of it), a simple list of 2500 None values consumes only 20kB, even if you collect that 1000 times its still only 20MB. More appropriate would be a list with unique_id and step, which can be approximated by

x = [[i, 1, None] for i in range(2500)]

Using

from pympler import asizeof
asizeof.asizeof(x)

we find that this list of lists consumes 300kB. So after 1000 steps we are at 300MB. Thats still quite far away from Colabs 13GB of RAM.

rht commented 1 year ago

This is the original agent_records

((1, 'A0', 1, 'A0'), (1, 'B0', None, 'B0'), (1, 'A1', 1, 'A1'), (1, 'B1', None, 'B1'), (1, 'A2', 1, 'A2'), (1, 'B2', None, 'B2'))

exclude_none_values only works if the agent_records is organized this way instead

{'satisfication': (('A0', 1), ('B0', None), ('A1', 1), ('B1', None), ('A2', 1), ('B2', None)), 'unique_id': (('A0', 'A0'), ('B0', 'B0'), ('A1', 'A1'), ('B1', 'B1'), ('A2', 'A2'), ('B2', 'B2'))}

where the tuple element can be dropped while safely retaining which agents have which values.

rht commented 1 year ago

You should measure/debug on the actual agent records object at https://colab.research.google.com/github/SFIComplexityExplorer/Mesa-ABM-Tutorial/blob/main/Session_19_Data_Collector_Agent.ipynb.

Corvince commented 1 year ago

You should measure/debug on the actual agent records object at https://colab.research.google.com/github/SFIComplexityExplorer/Mesa-ABM-Tutorial/blob/main/Session_19_Data_Collector_Agent.ipynb.

Thank you for the link, I couldn't find the right version. In your link I only had to change for _, x, y in self.grid.coord_iter(): to for _, (x, y) in self.grid.coord_iter(): to make it work.

Analyzing the actual agent_records object gave me 310MB of memory usage and for the None-removed version 9MB. So that was very nice to see my approximation of 300MB being exactly true.

But this also shows that while yes, removing None can save lots of space compared to the full dataset, no that doesn't prevent the model from being run on colab. I could easily store 10 model runs in colabs memory. @tpike3 I realized that having multiple tabs of colab open each session shares the same memory. So maybe you were simply doing too much colab work at the same time? We should also keep in mind that we are collectiong more than 4 millions data points here. I think 300MB isn't that bad for that given that most models collect much fewer data points.

This is the original agent_records

((1, 'A0', 1, 'A0'), (1, 'B0', None, 'B0'), (1, 'A1', 1, 'A1'), (1, 'B1', None, 'B1'), (1, 'A2', 1, 'A2'), (1, 'B2', None, 'B2'))

exclude_none_values only works if the agent_records is organized this way instead

{'satisfication': (('A0', 1), ('B0', None), ('A1', 1), ('B1', None), ('A2', 1), ('B2', None)), 'unique_id': (('A0', 'A0'), ('B0', 'B0'), ('A1', 'A1'), ('B1', 'B1'), ('A2', 'A2'), ('B2', 'B2'))}

where the tuple element can be dropped while safely retaining which agents have which values.

At first this looks nice and I like the semantics here of retaining what value is being collected. But I am afraid this won't scale very well. For this small example your version has a larger memory footprint (with None being removed of course), due to the dictionary overhead. That probably goes away with larger size, but it doesn't scale with collecting more attributes, because you always have to store the unique_id with each data value. For example:

('A0', 'a', 'b', 'c', 'd') 

would become

{'A': ('A0', 'a'), {'B': ('A0', 'b'), {'C': ('A0', 'c'), {'D': ('A0', 'd'), 

Which can easily take up more memory. So it will really depend on how many None values you have.

Also I am worried that we need additional code to put the dataframe back together and this will further complicate the code. And the reason to favor #1702 instead of #1701 was to have simpler code. That goes away for something that could also be done after the fact by simply calling df.dropna(). So I think this really depends on if we run out of memory or not. But we would need to have a reproducer for that .

rht commented 1 year ago

That makes a lot of sense now. Maybe it was a coincidence that @tpike3 's memory usage was relieved by freeing ~300 MB, in each sessions (as such, it could be gigabytes)?

Regarding with multi-agent data collection, @philip928lin already had the correct DF by using getattr(agent, "attr", None) without exclude_none_values, and without any additional feature needed in the library. I vote to remove exclude_none_values since it is not usable, but at the same time not includedto merge #1701, because it's optional at this point.

tpike3 commented 1 year ago

I just went back thru and actually found another change in Mesa 2.0 that broke the tutorial I need to go back in and fix. So i will try and get to that this weekend.

However, if you run session 20 (batch_run) and comment out line 204 in the the Model cell

#agent_trades = [agent for agent in agent_trades if agent[2] is not None]

This results in GBs of memory usage with one colab open.

You also need to change the instantiation of the sugar and spice landscape (lines 92 to 105) to ...

for _,pos in self.grid.coord_iter():
      max_sugar = sugar_distribution[pos[0],pos[1]]
      if max_sugar > 0: 
        sugar = Sugar(agent_id, self, pos, max_sugar)
        self.schedule.add(sugar)
        self.grid.place_agent(sugar, pos)
        agent_id += 1

      max_spice = spice_distribution[pos[0],pos[1]]
      if max_spice > 0: 
        spice = Spice(agent_id, self, pos, max_spice)
        self.schedule.add(spice)
        self.grid.place_agent(spice, pos)
        agent_id += 1    

@Corvince , @rht, @philip928lin let me know what to thinking on either something I am messing up or the best way to move forward.

Corvince commented 1 year ago

@tpike3 I can confirm that batch run leads to excessive memory usage, although it doesn't actually start that many model runs. I need to investigate this further but my first impression is that something is off with batch_run.

tpike3 commented 1 year ago

@tpike3 I can confirm that batch run leads to excessive memory usage, although it doesn't actually start that many model runs. I need to investigate this further but my first impression is that something is off with batch_run.

Thanks @Corvince I am wondering that to, maybe it wasn't the datacollector but batch_run. I l always behind, but I will dabble with it as well.

EwoutH commented 1 year ago

I have another proposal.

Objective:
Modify Mesa's DataCollector to allow data collection from multiple agent types while retaining backward compatibility for users relying on existing behavior.

High level overview

By using agent_specific_reporters, you can, for example, collect data on the "hunger" attribute for agents of the "Predator" class while gathering data on the "fear" attribute for agents of the "Prey" class, ensuring that each attribute is relevant to the corresponding agent type.

Proposed Changes:

  1. Add an additional parameter, agent_specific_reporters, to the DataCollector constructor.
  2. Modify the data collection mechanism to store agent data in a nested dictionary, with agent type as the key for the outer dictionary, if agent_specific_reporters are defined.
  3. Retain existing behavior if agent_specific_reporters is not defined.

Details:

  1. Agent-specific Reporters Parameter
    The agent_specific_reporters parameter would be a dictionary where:

    • Keys are agent types (classes).
    • Values are dictionaries specifying the agent variables to collect for that specific agent type.
    def __init__(
        self,
        model_reporters=None,
        agent_reporters=None,
        tables=None,
        agent_specific_reporters=None
    ):
        ...
        self.agent_specific_reporters = agent_specific_reporters or {}
  2. Nested Data Collection Mechanism
    If agent_specific_reporters are defined, modify _record_agents to store agent data in a nested dictionary format, with the outer layer being the agent type. This will allow users to retrieve data for specific agent types more easily.

    The structure would look something like this:

    {
        AgentType1: {
            AgentID1: {var1: value, var2: value},
            AgentID2: {var1: value, var2: value},
            ...
        },
        AgentType2: {
            AgentID3: {var1: value, var2: value},
            ...
        },
        ...
    }
  3. Retain Existing Behavior
    When agent_specific_reporters is not specified, the data collection behavior remains identical to the existing implementation to ensure backward compatibility. Data will not be nested by agent type in this case.

Usage Examples:

  1. Using the New agent_specific_reporters Parameter
    Suppose we have two agent types, Sheep and Wolf. We want to collect the energy data for both, but only wool data for Sheep.

    model_reporters = {"total_energy": lambda m: sum(a.energy for a in m.schedule.agents)}
    
    agent_reporters = {"energy": "energy"}
    
    agent_specific_reporters = {
       Sheep: {"wool": "wool_amount"}
    }
    
    data_collector = DataCollector(
       model_reporters=model_reporters,
       agent_reporters=agent_reporters,
       agent_specific_reporters=agent_specific_reporters
    )

    After running the model, the agent data retrieved would look something like:

    {
       Sheep: {
           1: {"energy": 10, "wool": 5},
           2: {"energy": 12, "wool": 6},
           ...
       },
       Wolf: {
           3: {"energy": 15},
           ...
       }
    }
  2. Existing Behavior (No agent_specific_reporters Specified)
    If a user doesn't use the new parameter, behavior remains as is.

    agent_reporters = {"energy": "energy"}
    
    data_collector = DataCollector(agent_reporters=agent_reporters)

    After running the model, the agent data would look like:

    {
       1: {"energy": 10},
       2: {"energy": 12},
       ...
    }

Updated get_agent_vars_dataframe!

Finally, we can update get_agent_vars_dataframe to return a single DataFrame (current behaviour, or a dictionary with a DataFrame per agent type.

  1. One DataFrame: All agent data in a single table. If some variable is missing, it will be None / N/A.
  2. Dict of DataFrames: A dictionary where the key's the agent type, and the value's its DataFrame.

Usage:

So that's it. @jackiekazil @tpike3 @rht @Corvince and others, curious what you think.

Implementation details can be discussed later, as we have agreed on parameter and output syntax / data structure.

Corvince commented 1 year ago

On a first impression this looks very promising. Thanks for the detailed proposal. I think the API would be clear and unambiguous.

I also think that we should use this opportunity on how the collected data should be structured before it is turned into a DataFrame, so it could potentially be retrieved in different forms as well. But this should not get in the way of finally establishing a solution for multi agent data collection.

I will have to think a bit more about the proposal but I hope it will be well received by others as well.

tpike3 commented 1 year ago

Thanks @EwoutH this is a great proposal.

I am good with this, to me @EwoutH covered my concerns which is backwards compatible and easy to understand for users. I know we have tried a few implementations before that are trying to be more elegant (i.e. they use existing agent_reporter set up), but I think we need this feature and we should execute and if needed iterate on it further.

rht commented 1 year ago

What about when you want to filter agents based on some conditions that are not intrinsic to the agent class, such as in the Epstein civil violence? And that possibly that there are several different classes that are put into 1 group (multilevel Mesa)? Adding a custom filter to the data collector in https://github.com/projectmesa/mesa/pull/1813#issue-1904577128 at least covers this.

EwoutH commented 1 year ago

I also think that we should use this opportunity on how the collected data should be structured before it is turned into a DataFrame, so it could potentially be retrieved in different forms as well.

Agreed. What do you think of the two options I suggested in _Updated get_agent_vars_dataframe_ (last part of the proposal)?

@tpike3 Thanks. I think the important thing is to keep it moving, that can also be further discussing and analyzing the problem and solution. It doesn't need to be finished this week, as long as it doesn't come to a standstill.

What about when you want to filter agents based on some conditions that are not intrinsic to the agent class, such as in the Epstein civil violence?

That's an interesting case! Could you work out a bit more detailed proposal with some usage examples? Then we can compare both and either choose one or possibly take the best parts of both of them.

rht commented 1 year ago

Could you work out a bit more detailed proposal with some usage examples?

It's rather brief.

# Filter based on class, as in Wolf-Sheep
dc_wolf = mesa.DataCollector(agent_reporters=..., agent_filter=lambda a: isinstance(a, Wolf))
dc_sheep = mesa.DataCollector(agent_reporters=..., agent_filter=lambda a: isinstance(a, Sheep))
# Filter based on arbitrary condition, as in Epstein civil violence
# An individual agent my change their "type" over time, and this filter supports that.
dc_active = mesa.DataCollector(agent_reporters=..., agent_filter=lambda a: a.condition == "Active")

@tpike3 should have more concrete examples for multilevel Mesa with the Sugarscape model. You then export each of them to DF, and if one knows how the current Mesa data collector works, there is no further learning of the API behavior is needed. While it is very general, the downside is that you need several data collector objects for each group. But this downside too can be solved by implementing a convenient wrapper class/function.

On another note, I see https://github.com/projectmesa/mesa/issues/348#issuecomment-1777533710 as an upgrade to #1701. The self.agent_vars object should probably be implemented in a separate+swappable+modular classes (AgentDataSimple for the current data collector, and AgentDataByClass for your proposal). The 2 solutions (https://github.com/projectmesa/mesa/issues/348#issuecomment-1777533710 and agent_filter) should have similar performance because both technically filter the agent from scratch every time the data collectorcollection happens.

EwoutH commented 1 year ago

Right, so you have multiple DataCollector instances. I really like that you can filter on properties and states of agents, that sounds very useful. It would also make it easy to collect data from specific agents, since you can just say a.unique_id==0 for example.

You have to call each datacollector each step right? What does that do in terms of performance?

EwoutH commented 1 year ago

It also calls a bit on the question on what we want a datacollector to be. Because now it's like this one "do it all" object, collecting multiple variables from both models and agents.

Aside from the current two layers:

We kind of want to add a third and maybe forth layer

So in the most extended form, there is one datacollector object that can handle all those layers.

In the smallest form, a datacollector collects one variable from one object.

Currently it's a bit in between. So I think the best question is: What is the optimal size of the datacollector object.

Other form is that you do allow multiple variables, but only from one object type.

dc1 = mesa.DataCollector(reporters={"number_of_sheep": "number_of_sheep"}, target=WolfSheepModel, filter=None)
dc2 = mesa.DataCollector(reporters={"energy": "energy"}, target=Sheep, filter=None)
dc3 = mesa.DataCollector(reporters={"energy": "energy", "prey_eaten": "prey_eaten"}, target=Wolf, filter="prey_eaten">5)

If you assume a datacollector always collects from only one Class (so either one Model or one Agent), it gives some kinds of consistency and modularity, instead of sometimes having multiple classes it collects data from and sometimes one.

Because let's be honest, currently you already do.

model_vars = data_collector.get_model_vars_dataframe()
agent_vars = data_collector.get_agent_vars_dataframe()

So it already feels like different things.

Then we could write some functions and methods to conveniently run and merge these, etc.

EwoutH commented 1 year ago

@jackiekazil @tpike3 @rht @Corvince what do you think?

Corvince commented 1 year ago

I think the discussion about your recent proposal has derailed, again šŸ˜…

Personally I would prefer to keep a single datacollector. I think it would be annoying to handle multiple collectors and in terms of performance we probably end up iterating over all agents multiple times, which isn't ideal.

I think a filter function can be useful for memory constraint environments (as we have witnessed before), but I don't think this is in contrast to your previous proposal, but rather additive.

Also both ideas revolve around syntactic sugar for our function reporter. Both collect-by-class and filter could be used as a function:

def reporter(agent):
    if isinstance(agent, SomeClass) and agent.value > 5:
        return agent.something

But I think the last proposal has a very explicit and intuitive interface for this

EwoutH commented 1 year ago

I think the discussion about your recent proposal has derailed, again šŸ˜…

I understand. But since the many troubles we had with datacollector configuration and scalability, I think it's very important to have a good discussion about scope and scale. If we can make some architectural choice that simplified future scalability and extensibility, that can well be worth it.

So still very curious about everyone's thoughts on that part (see my last comment).

rht commented 1 year ago

How would you specify all agents data collection in the format of dc2 = mesa.DataCollector(reporters={"energy": "energy"}, target=Sheep, filter=None)? Would the target be WolfSheepModel, but instead you collect agents data?

And you consider

model_vars = data_collector.get_model_vars_dataframe()
agent_vars = data_collector.get_agent_vars_dataframe()

to be hardcoded instances of a more general data collector collection, dc_collection.get_data_collector("all_agents"). I think at this point it'd be worthwhile to look around for other computational libraries with huge data footprint that use DF for storing data collection. Agents.jl currently has agent level and model level DFs.

But aside from discussion on scalability, having the filter function now would solve a lot of the use cases that run on a single laptop/Colab.

Corvince commented 1 year ago

I finally have some time to respond properly!

Other form is that you do allow multiple variables, but only from one object type.

dc1 = mesa.DataCollector(reporters={"number_of_sheep": "number_of_sheep"}, target=WolfSheepModel, filter=None)
dc2 = mesa.DataCollector(reporters={"energy": "energy"}, target=Sheep, filter=None)
dc3 = mesa.DataCollector(reporters={"energy": "energy", "prey_eaten": "prey_eaten"}, target=Wolf, filter="prey_eaten">5)

If you assume a datacollector always collects from only one Class (so either one Model or one Agent), it gives some kinds of consistency and modularity, instead of sometimes having multiple classes it collects data from and sometimes one.

What I don't like about this approach is that it creates a lot of unncessary code. A models step function would probably look like this

def step():
    ...
    dc1.collect()
    dc2.collect()
    dc3.collect()

And the final collection will look like this

model_vars = dc1.get_data()
sheep_data = dc2.get_data()
wolf_data = dc2.get_data()

This is not how I would expect datacollection to happen. For me the name datacollector already implies that this is a single instance that does all the collection for me.

Then we could write some functions and methods to conveniently run and merge these, etc.

True, but then what is the benefit of splitting it up when we finally put it back together?

But still I like the structure of your API. Maybe we can find a way to keep the explicitness of your idea within a single class. What about something like

dc = DataCollector(
    model,
    {
        model_vars: collect(
            target=WolfSheepModel,
            attribute_names=["sheep_count", "wolf_count",],
            reporter_functions={
                "grass_count": lambda m: m.schedule.get_breed_count(Grass),
            },
        ),
        wolf_vars: collect(
            target=Wolf,
            attribute_names=["energy"],
            sort_key=lambda agent: agent.unique_id,
        ),
        sheep_vars: collect(
            target=Sheep,
            attribute_names=["energy"],
            filter=lambda agent: agent.energy > 10,
        ),
    }
)

This frees us from our static model_reporters, agent_reporters default and users can use and report what ever they want. Packed in here are some other ideas worth considering:

  1. Using a dedicated collect function to describe the data that should be collected
  2. Using a simple list of attributes as a default way to collect class attributes (instead of "energy": "energy")
  3. Using a dict of reporter functions for more complex data collection
  4. Optional filter function (same as discussed before)
  5. Optional sort_key (idea supplied by GitHub Copilot)

In the end we can collect the data via

model_vars, wolf_vars, sheep_vars = dc.get_data()

Which returns the data in a format that can easily be turned into a DataFrame. Because thats another thing I would rethink. Currently the data can only be received as a pandas DataFrame and we only have our pandas dependency because we convert our data into a dataframe just at the moment a user requests the data. While this is convenient for most use cases, it should still be optional (and the dependency not strictly required).

EwoutH commented 1 year ago

@Corvince This is exactly what I hoped for in this discussion, thanks a lot. I really like you're proposal. If you have a single agent, it's as simple as it is now, plus the added flexibility of being able to conditionally collect data (basically the filters). The dedicated collect function and simple lists of attributes are really nice, both flexible and easy for beginners.

A few questions:

  1. The model_vars, wolf_vars and sheep_vars are the dictionary keys and should be strings right? ( "model_vars", etc.)
  2. Does model need to be passed explicitly? In practice that will just be self right, since it's in a model?
  3. target should be a mesa Model or Agent (sub)class, and we detect at runtime which one it is. If a Model, a regular dict is returns (step as key), if a Agent a nested dict (outer key step, inner key agent.unique_id).
    • If you pass Agent, does is collect from all Agent subclasses?
    • Do we allow Model?
  4. Are there only two possibilities for collect, attribute_names and reporter_functions, or do we also want to keep the current Method of a class/instance and Functions with parameters placed in a list options?
  5. dc.get_data() returns by default all data dictionaries collected. Can we allow dc.get_data("wolf_vars") to only get that dictionary, for example?

Maybe we can also add something time based, I'm thinking of start and stop steps/conditionals (start: 20 will start collecting at step 20) and maybe a collection interval (interval=5 would collect the data every 5 steps). Maybe we can merge this elegantly in a single variable, maybe just an actual Python range (range(start,stop,step))?

Corvince commented 1 year ago

@Corvince This is exactly what I hoped for in this discussion, thanks a lot. I really like you're proposal. If you have a single agent, it's as simple as it is now, plus the added flexibility of being able to conditionally collect data (basically the filters). The dedicated collect function and simple lists of attributes are really nice, both flexible and easy for beginners.

Thank you! I will try to answer these questions, but of course my answers must not be authoritative, but are up for discussions.

A few questions:

  1. The model_vars, wolf_vars and sheep_vars are the dictionary keys and should be strings right? ( "model_vars", etc.)

Yes, I'll update my comment to reflect this.

  1. Does model need to be passed explicitly? In practice that will just be self right, since it's in a model?

I think we have to, otherwise we would have no access to the attributes or the list of agents.

  1. target should be a mesa Model or Agent (sub)class, and we detect at runtime which one it is. If a Model, a regular dict is returns (step as key), if a Agent a nested dict (outer key step, inner key agent.unique_id).

Thats exactly how I envision it.

  • If you pass Agent, does is collect from all Agent subclasses?

Good idea, I actually haven't thought about how to collect from all agents, but simply using Agent sounds like a nice solution.

  • Do we allow Model?

This would be inline with the previous response. I see no reason to forbid this. In practice we will probably use isinstance and this should pass this check as well (not 100% sure on this)

  1. Are there only two possibilities for collect, attribute_names and reporter_functions, or do we also want to keep the current Method of a class/instance and Functions with parameters placed in a list options?

I would restrict this to attribute_names and functions-likes. I think class/instance methods should automatically work as well

  1. dc.get_data() returns by default all data dictionaries collected. Can we allow dc.get_data("wolf_vars") to only get that dictionary, for example?

Thats a good idea. It could also take other parameters as well, for example `.get_data(format='dataframe')

Maybe we can also add something time based, I'm thinking of start and stop steps/conditionals (start: 20 will start collecting at step 20) and maybe a collection interval (interval=5 would collect the data every 5 steps). Maybe we can merge this elegantly in a single variable, maybe just an actual Python range (range(start,stop,step))?

Excellent idea! This would allow to do datacollection much more declaratively.

Corvince commented 1 year ago

Another idea:

What if we separate the model from the data collection more strictly? That is we define the model strictly in terms of its functionality and we also define the data collector like in the previous proposal. Then we start the model by calling

run_model(model, datacollector)

This way we would only apply the datacollection overhead if we actually need it, for example not for visualizations or for things like mesa-replay.

//edit (hit post too soon)

This would also be more in line with batch_run, which could also take a (possibly different) datacollector

EwoutH commented 1 year ago

I'm a big fan of Corvince's latest proposal. I think it's both elegant and adds a huge amount of capability and flexibility!

@jackiekazil @tpike3 @rht I'm really curious what you think!

EwoutH commented 11 months ago

I would like to give implementing @Corvinceā€™s proposal a go. But before doing that, I need to know if there is broader support or if we need to go a different direction, like my previous proposal or otherwise.

I would also really hope we can move this forward. ā€œNoā€, ā€œI disagreeā€, ā€œI donā€™t have timeā€, ā€œthis shouldnā€™t be a priorityā€ are all legit answers. But please just communicate anything, than everyone knows how the deck is stacked.

EwoutH commented 10 months ago

Now that #1894 is merged, we can take a further look at datacollection. #1911 might also help, maybe we can use a data collection as input for the datacollector (or use a similar API).

EwoutH commented 7 months ago

Note that this discussion is largely continued here:

EwoutH commented 2 months ago

I'm posting back here instead of #1944, because it directly follows a proposal here:

I'm inclined to say that @Corvince was closest with his API:

dc = DataCollector(
    items={
        "wolf_vars": collect(
            target=model.get_agents_of_type(Wolf),
            attributes={
                "energy": "energy",
                "healthy": lambda a: a.energy > 5,
            aggegates={
                "mean_energy": ("energy", np.mean()),
                "number_healty": ("healthy", sum()),
        ),
    }
)

This would return the following dictionary:

{
    "wolf_vars": {
        "attributes": {
            "agent_id": [1, 2, 3],       # List of agent IDs
            "energy": [3, 7, 10],        # Energy levels of each wolf
            "healthy": [False, True, True],  # Whether each wolf is healthy (energy > 5)
        },
        "aggregates": {
            "mean_energy": 6.67,         # Mean energy of all wolves
            "number_healthy": 2          # Number of healthy wolves
        }
    }
}

Implementation wise, this could roughly look like:

class DataCollector:
    def __init__(self, items):
        self.items = items
        self.data = {
            key: {
                "attributes": {},
                "aggregates": {}
            }
            for key in items
        }

    def collect(self, model):
        for item_name, item_details in self.items.items():
            attributes = item_details['attributes']
            aggregates = item_details['aggregates']

            # Collect agent IDs
            self.data[item_name]["attributes"]["agent_id"] = agents.get("unique_id")

            # Collect attributes for each agent
            for attr_name, attr_func in attributes.items():
                if str(attr_func):
                   # Use AgentSet.get()
                else:
                   # Use AgentSet.apply()

            # Collect aggregates
            for agg_name, (attr_name, agg_func) in aggregates.items():
                values = self.data[item_name]["attributes"][attr_name]
                self.data[item_name]["aggregates"][agg_name] = agg_func(values)

I think this gives a huge amount of flexibility, while offering a logical code path: First collect the raw agent data, than aggegrate if needed.

A nice benefit is that AgentSet.get() or AgentSet.apply() only needs to be applied once per variable.

EwoutH commented 2 months ago

One thing which could be considered is not running aggerates per collect() function, but once per DataCollector object. This way you could in theory combine aggerates from different collect() targets and the model

rht commented 2 months ago

The new ideas that I can incorporate into #2199:

I still find the API to be too hectic, too verbose for casual users to intuitively remember. Unless there is a key feature that the simple API in #2199 can't cover. And hence why I implemented the way I did in #2199 and ditched the fancy measure classes. Reminder that there is not much time left on the drawing board. Only ~2 weeks left.

Corvince commented 2 months ago

One challenge I keep encountering is with the terminology we're using. I believe we're conflating data collection and data analysis too often, which muddies the distinction between the two and distracts us from what we want to achieve.

In my current job, Iā€™ve had to re-evaluate various libraries, focusing on what makes some more user-friendly than others. Iā€™ve found that the deciding factor in terms of ease of use is having sensible defaults combined with the ability to fully customize under the hood. An intuitive API for data collection, in my view, would look something like this:

data = run_model(model)

However, this kind of simplicity is missing from the current API. The reason I advocate for this approach is that I typically prefer to collect as much data as possible during the model run and perform the analysis afterward, either through custom functions or with built-in Mesa functions. By default, I believe we should automatically collect all agent and model attributes (and possibly every property) at every step. Aggregates, by their nature, can be calculated post-run. Expressions like "healthy": lambda a: a.energy > 5 either belong in the analysis phaseā€”meaning they don't need to be calculated during runtimeā€”or they are intrinsic to the model and should therefore be treated as their own attribute or property.

I anticipate concerns about the potential performance impact of this approach. However, I don't think this will be a significant issue for most models. Data collection should be implemented at a low level, with more "expensive" convenience functions like .todf() being applied afterward. Of course, users should still have the ability to fully customize data collection by providing:

data = run_model(model, data_collector=DataCollector(...))

This way, the API can focus on being flexible rather than overly concise. It doesnā€™t need to be memorized for every model but can be something that users opt into when they have specific requirements.