Open davidanthoff opened 5 years ago
Agreed. Do we have any idea how many people are using the built-in MCS? Changing simresult[:x, :y]
might not be noticed at all...
Also agree, and I don't think this would affect anyone besides me and maybe Lisa.
I think that Lisa might have implemented it as a dataframe for this though because of variables that have a time and region dimension. This result becomes a 3-D array when you add in the dimension of MCS trials. A dataframe for this is more interpretable, so we would just have to be clear about the extra dimensions somehow if we are going to switch it to an array.
An array, though, is also a much more efficient data structure for this data.
So would we want the "extra" dimension for the number of trials to be the first or last dimension of the array?
I'd say first. For one, that seems most natural to me, and it should also be the most efficient way, given how arrays are stored (i.e. all the values for all the runs for a year/region combination then would be stored next to each other).
@rjplevin @davidanthoff I'm looking at this now, and the results values in a SimulationInstance
are being stored internally as dataframes:
mutable struct SimulationInstance{T}
trials::Int
current_trial::Int
current_data::Any # holds data for current_trial when current_trial > 0
sim_def::SimulationDef{T} where T <: AbstractSimulationData
models::Vector{M} where M <: AbstractModel
results::Vector{Dict{Tuple, DataFrame}}
payload::Any
function SimulationInstance{T}(sim_def::SimulationDef{T}) where T <: AbstractSimulationData
self = new()
self.trials = 0
self.current_trial = 0
self.current_data = nothing
self.sim_def = deepcopy(sim_def)
self.payload = deepcopy(self.sim_def.payload)
# These are parallel arrays; each model has a corresponding results dict
self.models = Vector{AbstractModel}(undef, 0)
self.results = [Dict{Tuple, DataFrame}()]
return self
end
end
and the getindex
function is just returning what's stored there:
function Base.getindex(sim_inst::SimulationInstance, comp_name::Symbol, datum_name::Symbol; model::Int = 1)
return sim_inst.results[model][(comp_name, datum_name)]
end
Do we want to also change this internal storage structure to just be arrays, or should I only modify the getindex
function to return an array instead of a dataframe?
Storing as an array should be way more efficient, right? So changing that to array storage internally seems like a good idea to me. But @lrennels should probably also weigh in because her stuff might rely on the current behavior.
Before we do this, we might want to move to streaming output to avoid memory limitations. Then we might allow the user to call different API funcs to realize the data as an in-memory array or DF.
When working on these I did add an issue #525 which has the description:
Currently in-memory results are stored in the SimulationInstance
as results::Vector{Dict{Tuple, DataFrame}}
. It would be more efficient to store this instead as results::Vector{Dict{Tuple, Array}}
with a multi-dimensional array instead of the flattened DataFrame which has redundancy.
It's just a bit tricky in the internals to store as an array due to the potential for 3 or even more dimensions, as @ckingdon95 pointed out, so I didn't want to handle all of it at once. The issue with returning an array is that it's fairly ambiguous when it comes to those 3rd, 4th etc. dimensions, so I thought a DataFrame
might be better there, but I can see why consistency with indexing into a Model
would be important too.
Couldn't we just generally say that we add the run dimensions as the first dimension, and then all the other dimensions are just whatever was specified in the @defcomp
?
The point with the streaming output is good, but I don't think we can sort that out before 1.0, right?
So not sure what to do... Maybe just remove the indexing via []
for now, so that we can "reserve" that syntax for our super dupa solution, once we have one, and for now only offer function calls to get results?
We might need a new type that stores data as a N-dim array, but also carries meta-info that describes the dimensions. Perhaps it already exists... seems generally useful.
The dimensions here are not specified in the @defcomp
, they are specified in @defsim
because they are things like extra scenarios you want to run your simulation for.
Ah I forgot about different scenarios being treated as another dimension... maybe the scenarios should be treated like how multiple models are treated? Like a separate dictionary of results (Dict{Tuple, Array}
) per model-scenario combo?
Might be useful: https://juliaobserver.com/packages/MetadataArrays
There is also AxisArrays.jl and NamedArrays.jl, or something like that...
The more I think about this, the more I think the most important thing is that for 1.0 we pick something that we can augmented/change later on without breaking things.
For now for 1.0.0: take away square bracket indexing into a sim instance, only allow users to call getdataframe
(need to add). Later, figure out how to change the internal storage to be arrays instead of dataframes, and possibly combine with streaming.
Right now we have an asymmetry in the API, in that
model[:foo, :bar]
returns an array, butsimresult[:foo, :bar]
returns aDataFrame
.I think we should align those, so that indexing always returns arrays, and
getdataframe
always returnsDataFrame
s, for both the deterministic and the probabilistic case.Problem of course is that this would be a breaking change...