mimiframework / Mimi.jl

Integrated Assessment Modeling Framework
https://www.mimiframework.org
Other
68 stars 34 forks source link

Indexing into a sim result should return an array #558

Open davidanthoff opened 5 years ago

davidanthoff commented 5 years ago

Right now we have an asymmetry in the API, in that model[:foo, :bar] returns an array, but simresult[:foo, :bar] returns a DataFrame.

I think we should align those, so that indexing always returns arrays, and getdataframe always returns DataFrames, for both the deterministic and the probabilistic case.

Problem of course is that this would be a breaking change...

rjplevin commented 5 years ago

Agreed. Do we have any idea how many people are using the built-in MCS? Changing simresult[:x, :y] might not be noticed at all...

corakingdon commented 5 years ago

Also agree, and I don't think this would affect anyone besides me and maybe Lisa.

I think that Lisa might have implemented it as a dataframe for this though because of variables that have a time and region dimension. This result becomes a 3-D array when you add in the dimension of MCS trials. A dataframe for this is more interpretable, so we would just have to be clear about the extra dimensions somehow if we are going to switch it to an array.

davidanthoff commented 5 years ago

An array, though, is also a much more efficient data structure for this data.

corakingdon commented 5 years ago

So would we want the "extra" dimension for the number of trials to be the first or last dimension of the array?

davidanthoff commented 5 years ago

I'd say first. For one, that seems most natural to me, and it should also be the most efficient way, given how arrays are stored (i.e. all the values for all the runs for a year/region combination then would be stored next to each other).

corakingdon commented 5 years ago

@rjplevin @davidanthoff I'm looking at this now, and the results values in a SimulationInstance are being stored internally as dataframes:

mutable struct SimulationInstance{T}
    trials::Int
    current_trial::Int
    current_data::Any               # holds data for current_trial when current_trial > 0
    sim_def::SimulationDef{T} where T <: AbstractSimulationData
    models::Vector{M} where M <: AbstractModel
    results::Vector{Dict{Tuple, DataFrame}}
    payload::Any

    function SimulationInstance{T}(sim_def::SimulationDef{T}) where T <: AbstractSimulationData
        self = new()
        self.trials = 0
        self.current_trial = 0
        self.current_data = nothing
        self.sim_def = deepcopy(sim_def)
        self.payload = deepcopy(self.sim_def.payload)

        # These are parallel arrays; each model has a corresponding results dict
        self.models = Vector{AbstractModel}(undef, 0)
        self.results = [Dict{Tuple, DataFrame}()]

        return self
    end
end

and the getindex function is just returning what's stored there:

function Base.getindex(sim_inst::SimulationInstance, comp_name::Symbol, datum_name::Symbol; model::Int = 1)
    return sim_inst.results[model][(comp_name, datum_name)]
end

Do we want to also change this internal storage structure to just be arrays, or should I only modify the getindex function to return an array instead of a dataframe?

davidanthoff commented 5 years ago

Storing as an array should be way more efficient, right? So changing that to array storage internally seems like a good idea to me. But @lrennels should probably also weigh in because her stuff might rely on the current behavior.

rjplevin commented 5 years ago

Before we do this, we might want to move to streaming output to avoid memory limitations. Then we might allow the user to call different API funcs to realize the data as an in-memory array or DF.

lrennels commented 5 years ago

When working on these I did add an issue #525 which has the description:

Currently in-memory results are stored in the SimulationInstance as results::Vector{Dict{Tuple, DataFrame}}. It would be more efficient to store this instead as results::Vector{Dict{Tuple, Array}} with a multi-dimensional array instead of the flattened DataFrame which has redundancy.

It's just a bit tricky in the internals to store as an array due to the potential for 3 or even more dimensions, as @ckingdon95 pointed out, so I didn't want to handle all of it at once. The issue with returning an array is that it's fairly ambiguous when it comes to those 3rd, 4th etc. dimensions, so I thought a DataFrame might be better there, but I can see why consistency with indexing into a Model would be important too.

davidanthoff commented 5 years ago

Couldn't we just generally say that we add the run dimensions as the first dimension, and then all the other dimensions are just whatever was specified in the @defcomp?

The point with the streaming output is good, but I don't think we can sort that out before 1.0, right?

So not sure what to do... Maybe just remove the indexing via [] for now, so that we can "reserve" that syntax for our super dupa solution, once we have one, and for now only offer function calls to get results?

rjplevin commented 5 years ago

We might need a new type that stores data as a N-dim array, but also carries meta-info that describes the dimensions. Perhaps it already exists... seems generally useful.

lrennels commented 5 years ago

The dimensions here are not specified in the @defcomp, they are specified in @defsim because they are things like extra scenarios you want to run your simulation for.

corakingdon commented 5 years ago

Ah I forgot about different scenarios being treated as another dimension... maybe the scenarios should be treated like how multiple models are treated? Like a separate dictionary of results (Dict{Tuple, Array}) per model-scenario combo?

rjplevin commented 5 years ago

Might be useful: https://juliaobserver.com/packages/MetadataArrays

davidanthoff commented 5 years ago

There is also AxisArrays.jl and NamedArrays.jl, or something like that...

The more I think about this, the more I think the most important thing is that for 1.0 we pick something that we can augmented/change later on without breaking things.

corakingdon commented 5 years ago

For now for 1.0.0: take away square bracket indexing into a sim instance, only allow users to call getdataframe (need to add). Later, figure out how to change the internal storage to be arrays instead of dataframes, and possibly combine with streaming.