Make MCDA functions generic

ConnectedSystems commented 1 year ago

Make MCDA functions generic to any number of criteria, and use auto-generated names as opposed to the current index-based approach which is prone to errors when making adjustments.

create_decision_matrix
create_seed_matrix

Could make changes to adopt JMcDM as suggested in #242 instead.

Rosejoycrocker commented 1 year ago

@ConnectedSystems Thoughts on the best way to make the MCDA functions accept any number of criteria with auto-generated names. I'm imagining this as a series of changes:

Change the mcda_vars constructor to create a table of criteria of size (nsites *ncriteria) with column names generated from the names of the variables it was constructed from, and a dict of weights of size ncriteria.
Change the decision matrix construction to just use the constructed table and weights to form a matrix, with normalisation.
The seeding and shading matrix contructors would then just select from this matrix based on some info about which columns are seeding criteria and which are shading (and later fogging, when these are separated).
Criteria which require more complicated calculations, such as the space left over for coral growth at each site, could be calculated in separate functions outside of the matrix construction and then fed into the mcda_vars constructor so that the decision matrix constructor could remain generic.
How did you imagine a restructure to allow more genericity? The MCDA techniques package could be cool to incoporate, allowing the use of more algorithms (although this would have to be restricted in sensitivity analyses).

Rosejoycrocker commented 1 year ago

Adding to this discussion here to flesh out planned changes to make MCDA more generic @ConnectedSystems

ConnectedSystems commented 1 year ago

Could the comments above be collated into a single comment per related set of changes, as opposed to a series of small comments? It's hard to track what is related to what at the moment.

I'll review once you're done writing things out.

Note: comments mentioned here have been reorganised into the three responses below.

Rosejoycrocker commented 1 year ago

Change 1 - New set of functions for creating dataframes defining weights, tolerances and criteria:

Plan to add function create_criteria_df which takes tuples ("criteria_name", [criteria vector]) to create a dataframe of crteria with "criteria_name"'s as columns.
First arguments will be mandatory criteria (I'm thinking coral cover, connectivity and heat/wave stress but we can leave out any of these)

Pseudo code:

function crteria_df(cover, space, connectivity, heat_stress,wave_stress, criteria...)

   criteria_df = DataFrame(coral_cover=cover, coral_space = space, connectivity=connectivity, heat_stress=heat_stress, 
   wave_stress=wave_stress)
   for each crit in criteria
         criteria_df[!,crit[1]] = crit[2]
  end
  return criteria_df 
end

criteria names would ideally be assigned by converting the input argument name to a string, but I couldn't find a way to do this with varargs (if you know of one?)
create_tolerance_df which takes tuples ("criteria name", [tol_val, "gt/lt"]), where "criteria_name" corresponds to any column names in criteria_df. Creates a dataframe with the same columns as criteria_df, but the first row is tolerance values and the second indicates filtering is greater than or less than with "gt" or "lt". Takes any number of inputs (and will check for name matches to criteria_df).
create_weights_df which takes tuples ("criteria_name", weight) with all "criteria_name"'s matching columns of criteria_df.
When ARIA initialises mcda variables, create_tolerance_df, create_weight_df and create_criteria_df can be called once and then just updated at each time step.
Additional parameters for distance sorting can be stored in a structure to be passed into guided_site_selection, such as the distances matrix, min_dist etc.

Rosejoycrocker commented 1 year ago

Change 2 - create_decision_matrix will have a more generic structure and will use tolerances_df to filter a matrix created from criteria df. create_seed_matrix and create_shade_matrix can be folded into a single function create_intervention_matrix which selects columns and weights for the intervention of interest.

New create_decision_matrix function simply converts criteria_df to an array and applies filters specified in the tolerances dataframe.

Psuedo code:

function create_decision_matrix(criteria_df,tolerances_df)

  A = Matrix(criteria_df)
  crit_cols = names(criteria_df)
  tol_cols = names(tolerances)

for name_ind in 1:length(tol_cols)
    find index of name in crit_cols
    if found
          if  "gt" in tolerance_df at index
              set rule <= tolerances_df value at index
          if "lt"  in tolerance_df at index
              set rule > tolerances_df value at index
          end
         A[findall(rule), index] .= NaN
     end
 end

 filtered = vec(.!any(isnan.(A), dims=2))
 # remove rows with NaNs
 A = A[filtered, :]
 return A
end

Returned matrix A can then be used in current MCDA algorithms, or can be converted to df for use with suggested mcda package.
there's probably a better way to indicate whether the tolderance is greater than or less than but that's the basic idea.
create_seed_matrix and create_shade_matrix will become a single function create_intervention_matrix, which takes A, criteria_df, weights_df and a vector of strings indicating the criteria to be used for the particular intervention. I imagine these as sim_constants maybe.
For exmaple the seeding matrix `SE = create_intervention_matrix(A,criteria_df, weights_df, ["coral_space", "connectivity","heat_stress","wave_stress"])
Psuedo code :

function create_intervention_matrix(A,criteria_df, weights_df,intervention_criteria_names)
 Find indices of cols in criteria_df corresponding to names in intervention_criteria_names
 Select these cols from A (+site_id column)
 Find indices in weights_df corresponding to names in intervention_criteria_names
 Select weights from weights_df using indices and normalise
 return intervention matrix, normalised weights
end

In ADRIA, seeding and shading matrices (and fogging if we choose to separate fogging and shading) can be created using create_intervention_matrix with different vectors of strings indicating criteria to be included in the matrix.
If/when we switch to using the julia package for mcda, the same functions can be used to make the decision matrices and weightings used.

mcda is implemented in JMcDM as

w  = [0.35, 0.15, 0.25, 0.20, 0.05];
fns = [minimum, maximum, minimum, minimum, maximum];
result = topsis(Matrix(df), w, fns);

So the matrix and weights can still be used with the package, with perhaps guided being a string instead of an integer indicating the algorithm to use.

Rosejoycrocker commented 1 year ago

Change 3- Move calculation of criteria into set of functions to be called prior to initialising and updating criteria_df.

adds a new file to sites, perhaps mcda_io.jl or similar which has functions which perform required calculations to produce criteria as vectors of length nsites to input in create_criteria_df

For example, the heat stress and wave stress criteria could be created with

function env_stress_criteria(env_stress)
return maximum(env_stress) != 0.0 ? (env_stress .- minimum(env_stress)) ./ (maximum(env_stress) - minimum(env_stress)) : zeros(Float64, size(env_stress))
end

An initialisation function initialise_mcda() would take in all necessary arguments to call these criteria functions and create input vectors for create_criteria_df. It could also initialise weights_df and tolerances_df.
An updating function update_mcda() would call these criteria functions to create updated criteri vectors to add to criteria_df.
This takes any reference to specific criteria formulations outside of the mcda functions so they can stay generic.

ConnectedSystems commented 1 year ago

Thank you, this is much nicer. Could you delete the other comments now that we have this write up?

As a general rule, in dynamic languages like Julia/Python/R/MATLAB - and especially in languages that support functional programming concepts - if you find that you're relying on a lot of loops, string matching and (nested) if/else statements, it is usually an indication that the design or envisioned approach is overly complicated and needs to be rethought. They are design smells so to speak.

Have a read through my responses to "Change 1" first and see if it might impact/change the design of the other changes.

Response to Change 1:

If the arguments for crteria_df() are used to construct a DataFrame, then are there any adverse implications from:

using splatted keyword arguments?
using a NamedDimsArray+AxisKeys instead of a DataFrame (better performance)

function criteria_df(; criteria...)
    # Add a check that the expected criteria are provided.
    # Ideally this would be agnostic to the order of inputs
    # (what if we need to add another kind of connectivity later? etc etc)
    return DataFrame(criteria)
end

If we're planning to leverage JMcDM, and it expects a matrix as the argument anyway, then we can avoid transforming between DataFrames -> matrices and just use matrices with named indices.

Note that readable code with documentation removes the need for you to specify the return type in the function name. The _df suffix, I don't find it particularly helpful, could a more informative name be thought up?

create_tolerance_df which takes tuples ("criteria name", [tol_val, "gt/lt"]), where "criteria_name" corresponds to any column names in criteria_df. Creates a dataframe with the same columns as criteria_df, but the first row is tolerance values and the second indicates filtering is greater than or less than with "gt" or "lt". Takes any number of inputs (and will check for name matches to criteria_df).

Is the plan here is to parse the string and run the associated operator? I understand why associating the threshold with the criteria is necessary, but this approach is very complicated. Strings are also computationally expensive to parse over and over (this is why Julia uses/recommends Symbols for the most part).

I would association the comparison function itself (note that operators are generic functions in Julia).

julia> <
< (generic function with 142 methods)

julia> <(5, 6)
true

I think the argument formatted as a tuple of (name, vector{value, string}) is going to cause issues - it's not straightforward. It could just be a function with splatted keyword args (as above).

function criteria_tolerances(; thresholds...)
    # return named matrix of lambdas
end

criteria_tolerances(; cover=(0.5, <), heat=(0.3, >), ... so on and so forth ...)

The return could be a named vector of lambdas/anonymous functions, of size N (where N == number of criteria).

The lambda would be something like x -> <(x, threshold_val)

Each entry has the criteria name as its key, and its entry is the comparison function.

As a for loop, it could be used like (I've not tried this of course):

for thresh in criteria
    thresh(val)
end

But you don't need a for loop in this case, you could just broadcast the values.

map.(criteria, some_values)

# Or using pipe syntax

some_values .|> criteria

create_weights_df which takes tuples ("criteria_name", weight) with all "criteria_name"'s matching columns of criteria_df.

Use named indices and you don't need this.

Additional parameters for distance sorting can be stored in a structure to be passed into guided_site_selection, such as the distances matrix, min_dist etc.

Is this a current or new struct?

Would this need to be updated each run? What performance implications are there?

For Change 2:

As mentioned above, seems your planned changes rely on a lot of string matching for loops and nested if statements. Please consider the feedback above and have a think about how you would simplify things.

So the matrix and weights can still be used with the package, with perhaps guided being a string instead of an integer indicating the algorithm to use.

A limitation of the samplers we're using is that they only work with numeric values. So it has to be a (discrete) value that can be mapped to an associated function. We can map the value directly to the function of interest, no need for a string.

See under "Examples" - function can be selected from a list of available methods. https://jbytecode.github.io/JMcDM/dev/utility/#mcdm

An implication is if new methods are added to JMcDM which changes the order of that list, messing up our expected sampled value -> function mapping. But this can be handled by only updating JMcDM between "minor" versions of ADRIA.

For Change 3:

For example, the heat stress and wave stress criteria could be created with

Is this just min-max normalization? Is it that different criteria require different normalization approaches? Do we really need a criteria-specific function to do this? Do all criteria require a normalization step?

We already have an ADRIA-context specific normalization function. Can we use that instead? https://github.com/open-AIMS/ADRIA.jl/blob/9dc45df4c53271aa83046e65bb85c2f7d2b223a6/src/analysis/analysis.jl#L43

adds a new file to sites, perhaps mcda_io.jl or similar which has functions which perform required calculations to produce criteria as vectors of length nsites to input in create_criteria_df

What are the "required calculations"? An example would be appreciated.

Generally io would indicate it is related to file interaction or networks or something along those lines. Are you thinking of reading/writing data as part of the process here?

An updating function update_mcda() would call these criteria functions to create updated criteri vectors to add to criteria_df.

Sounds like this would allocate memory? Have you thought of separating the core behaviour into a separate function so that it can be used to update matrices in place?

Additional considerations / feature requests:

We need to be able to constrain site selection to a specific spatial area (or areas). I have a very hacky approach for the recent ADRIA-informed ReefMod runs to work, but this kind of change should also be considered with the changes above
You might as well address #231 while you're at this
Other models use data not currently considered. These include CoTS connectivity, Cyclone severity where values are cyclone categories 1 - 5 (I think, will confirm). Will discrete values affect anything?
Once you have the return from MCDA - what do you do with it? Does anything change on that front? Are the outputs conceptually identical to what we're using now? What implications are there for ADRIA (both in terms of design and code)?

Rosejoycrocker commented 1 year ago

In response to your comments:

Happy to work with NamedDimsArrays andAxisKeys instead of DataFrames
I wasn't aware you could do that with julia operators so happy to do that for the tolerances
I will of course attempt to limit loops where possible
In your comment about criteria specific functions I probably used the wrong example. Many of the criteria are much more complicated to create than the example I used and so it is messy to do these inline without a function. For example creating the zones criteria is L457-L474 here : https://github.com/open-AIMS/ADRIA.jl/blob/9dc45df4c53271aa83046e65bb85c2f7d2b223a6/src/sites/dMCDA.jl#L457#L474 The creation of the priority predecessor criteria is similarly several lines. I also think maybe this is what you mean here: "Have you thought of separating the core behaviour into a separate function so that it can be used to update matrices in place?"

For the additional considerations/feature requests:

We need to be able to constrain site selection to a specific spatial area (or areas). I have a very hacky approach for the recent ADRIA-informed ReefMod runs to work, but this kind of change should also be considered with the changes above You might as well address https://github.com/open-AIMS/ADRIA.jl/issues/231 while you're at this

Could this be done using an extension to the depth_priority index, where only the indices of the set of sites in the specified area are used?

Other models use data not currently considered. These include CoTS connectivity, Cyclone severity where values are cyclone categories 1 - 5 (I think, will confirm). Will discrete values affect anything?

Discrete values won't effect anything :)

Once you have the return from MCDA - what do you do with it? Does anything change on that front?

The return is slightly different for the JMcDM functions but not too different, so just some tweaks here and there.

Are the outputs conceptually identical to what we're using now?
Yes, the JMcDM functions follow the same principles as the functions currently used.
What implications are there for ADRIA (both in terms of design and code)?

I don't think it will drastically change ADRIA proper, but depending on the format of the outputs of JMcDM the seeding and shading ranks logging may change, or there may need to be postprocessing functions to get the outputs in the same format as the seed/shade logs as they are now. Additionally, it may make sense to make the guided_site_selection generic to an intervention type - so we don't have to keep adding things if we want different interventions, we just change the criteria matrix and the rest stays the same.

ConnectedSystems commented 1 year ago

Happy to work with NamedDimsArrays andAxisKeys instead of DataFrames

Reminder to please use it in a way that is consistent with how its used elsewhere in ADRIA. e.g., Always use NamedDimsArray over KeyedArray (the two are not equivalent).

I wasn't aware you could do that with julia operators so happy to do that for the tolerances

You can do this in most languages. The syntax will be different of course.

In your comment about criteria specific functions I probably used the wrong example.

My objection wasn't about the use of a function per se, it was about the need to manually code up the function call for each criteria.

How do you know which criteria is more complicated, and are there common steps that can be done beforehand?

For example, in the example you linked above, it still normalizes the values. Is this a common step that should be done beforehand?

Could this be done using an extension to the depth_priority index, where only the indices of the set of sites in the specified area are used?

I guess so - I've again hacked something so it works with ReefMod data for now - involves an extra argument to the run_site_selection function.

On this note - we should move away from having "sites" in our function names as we're trying to be far more generic and "sites" is an overloaded term. Something like ADRIA.select_locations() is short and understandable I think. Do you have any suggestions?

depending on the format of the outputs of JMcDM the seeding and shading ranks logging may change

Okay, please have a close look at JMcDM - including trying to use it by itself - and see if the needed changes are worth it. I'd also ask that you profile its performance. No point in doing all this work to incorporate it if it means ADRIA runs will take minutes instead of milliseconds.

Additionally, it may make sense to make the guided_site_selection generic to an intervention type - so we don't have to keep adding things if we want different interventions, we just change the criteria matrix and the rest stays the same.

Yes, that's one of the aims with this rewrite.

ConnectedSystems commented 1 year ago

criteria_tolerances(; cover=(0.5, <), heat=(0.3, >), ... so on and so forth ...)

Realised this would look neater if you swap the elements:

criteria_tolerances(; cover=(<, 0.5), heat=(>, 0.3), ... so on and so forth ...)

This way the order neatly implies x < 0.5, etc

Rosejoycrocker commented 1 year ago

Some notes from looking into comparing JMcDM and the functions we have currrently for the mcda.

Comparing the average runtime for performing site rankings for 2000 randomised decision matrices with randomised weightings, 9 criteria and 200 sites, the current mcda functions were 3-4X faster when comparing Topsis and almost 8X faster when comparing Vikor.

Rankings are the same for the Topsis algorithm, as expected, but slightly different for Vikor. This is probably because there are many versions of Vikor which have further steps to detrermine trade-offs in the final set of ranks etc, so the package probably uses a variation.

Comparing a set of 1024 scenarios using the mcda topsis and JMcDM topsis (without distance sorting), there was no significant difference in absolute coral cover:

topsis_diff

Comparing a set of 1024 scenarios using the mcda vikor and JMcDM vikor (without distance sorting), the JMcDM game higher coral cover in the middle of the time period on average:

vikor_diff

From this it seems there is some benefit for more complicated techniques like vikor, but no real benefit for less complicated techniques like topsis.

(In both figures light blue lines are summed difference in absolute coral cover over sites, dark blue is the average over scenarios)

ConnectedSystems commented 1 year ago

Hi Rose, how were the timings taken? Did you use BenchmarkTools?

Rosejoycrocker commented 1 year ago

I used time(), is it preferable to use BenchmarkTools?

ConnectedSystems commented 1 year ago

For certain things yes - I was more interested in how you collected timings.

time() as in you took the start time then subtracted it from the end time? Like tic/toc in MATLAB? Or do you mean the @time macro?

Rosejoycrocker commented 1 year ago

Yeah so I did time_start = time() and then dt = time()-time_start

Rosejoycrocker commented 1 year ago

The average time difference per run varied quite a bit depending on the number of runs for JMcDM but the mcda functions were always at least a little bit faster.

ConnectedSystems commented 1 year ago

Okay, I think it's okay but note that at a minimum the @time macro is preferred for performance profiling.

Calculating the time elapsed that way will give misleading results because of the JIT compilation process. Given you're doing it for each run and taking the average it's less of an issue I think but in future use @time instead, and do it twice at a minimum.

It's a good idea to first run the code you intend to profile at least once (unless you want to profile Julia's JIT-compiler)

https://docs.julialang.org/en/v1/manual/profile/#Basic-usage

On the whole I guess I'm not surprised that JMcDM would be a little slower - likely they're doing extra checks or some other process we're not doing. Did you look at the code to see what the difference is?

EDIT: Hang on, you said our "bare" VIKOR is 8 times faster? Either they're doing a lot of extra checks or the implementation isn't as efficient as it could be.

Rosejoycrocker commented 1 year ago

It seems to be from the average timings, but I can check again using @time and running the code befroe timing to avoid compile time.

I've looked into topsis and they use a loop for calculating scores, while we use matrix operations. I think they do this because they allow the user to have different objectives for each column (whereas we transform each column so it is always maximised), but it's not entirely clear.

I just checked the vikor code and they are using the same formulation as us but they use a nested loop where they loop through each criteria and alternative and then have if statements to check whether you are maximising or minimising. So I think that's probably why it's slower, plus you have to set up a MCDM object for each run which can't be set up outside of the time loop because it includes the decision matrix.

Additionally, they have a wrapping function which allows you to specify the method you want, which I used, but I imagine it might be slightly faster to use the actual underlying method function directly (e.g. vikor(settings) instead of mcdm(setting, method=vikor())

ConnectedSystems commented 1 year ago

if statements to check whether you are maximising or minimising. So I think that's probably why it's slower

Okay, so it's slower because of the additional flexibility. They could just prepare a vector of functions and apply that for each row/column as we discussed (remember, our conversation about how operators are functions - they could do the same but create purpose-specific functions for each row/column as needed - and do that once instead of checking every time in a loop).

Given what you found, what would you prefer - stick with our own or adopt JMcDM? You could always submit a PR to JMcDM as well to address this performance issue.

Rosejoycrocker commented 1 year ago

Despite it being quite a bit slower, it could be cool to have access to their larger range of functions and be able to explore different site selection methods. What do you think? They have a lot of methods I considered adding but did not because they would be time consuming to implement myself. I suppose we could keep our functions as a more efficient option as well as having the JMcDM functions, I think it would quite easy to make a wrapper so that the site selection functions could do either, but maybe that is too clunky?

Also, just checked using @time macro and calling the function before timing, mcda is still around 7.3X faster for vikor and 3.8X faster for topsis.

Rosejoycrocker commented 1 year ago

Following yesterdays meeting, it was decided to add Criteria structures particular to each model in use to specify the criteria which can be used in site selection. For example, ReefMod will have additional values for the weights of cots for seeding and shading :

Base.@kwdef struct ReefModCriteria{P,N} <: EcoModel
wave_stress::P = Param(1.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Wave Stress", description="Importance of avoiding wave stress. Higher values places more weight on areas with low wave stress.")
    heat_stress::P = Param(1.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Heat Stress", description="Importance of avoiding heat stress. Higher values places more weight on areas with low heat stress.")
    in_connectivity_shade::P = Param(0.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Shade Connectivity", description="Higher values give preference to locations with high connectivity for shading deployments.")
    in_connectivity_seed::P = Param(1.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Incoming Connectivity (Seed)", description="Higher values give preference to locations with high incoming connectivity (i.e., receives larvae from other sites) for enhanced coral deployments.")
    out_connectivity_seed::P = Param(1.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Outgoing Connectivity (Seed)", description="Higher values give preference to locations with high outgoing connectivity (i.e., provides larvae to other sites) for enhanced coral deployments.")
    coral_space_seed::P = Param(0.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Low Coral Cover", description="Higher values give greater preference to sites with low coral cover for seeding deployments.")
    coral_cover_shade::P = Param(0.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="High Coral Cover", description="Higher values give preference to sites with high coral cover for shading deployments.")
    priority_seed::P = Param(1.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Predecessor Priority (Seed)", description="Importance of seeding sites that provide larvae to priority reefs.")
    priority_shade::P = Param(0.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Predecessor Priority (Shade)", description="Importance of shading sites that provide larvae to priority reefs.")
    zone_seed::P = Param(0.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Zone Predecessor (Seed)", description="Importance of seeding sites that provide larvae to priority (target) zones.")
    zone_shade::P = Param(0.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Zone Predecessor (Shade)", description="Importance of shading sites that provide larvae to priority (target) zones.")
cots_seed::P = Param(0.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="COTs (Seed)", description="Importance of seeding sites that are COTs management sites.")
cots_shade::P = Param(0.0, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="COTs (Shade)", description="Importance of shading sites that are COTs management sites.")heat_stress_tol::P = Param(0.2, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Low Area Tolerance", description="Tolerance for low proportional space for seeding deployments.")
wave_stress_tol::P = Param(0.2, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Low Area Tolerance", description="Tolerance for low proportional space for seeding deployments.")
    coral_cover_tol::P = Param(1.0, ptype="real", bounds=(0.75, 1.0), dists="unif",
        name="Risk Tolerance", description="Filters out sites with heat/wave stress above threshold.")
    use_dist::N = Param(1, ptype="integer", bounds=(0.0, 1.0 + 1.0), dists="unif",
        name="Use Distance Threshold", description="Turns distance sorting on or off.")
    dist_thresh::P = Param(0.1, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Distance Threshold", description="Sites selected by MCDA must be further apart than median(dist)-dist_thresh*median(dist).")
    top_n::N = Param(10, ptype="integer", bounds=(5.0, 50.0 + 1.0), dists="unif",
        name="Top N", description="Replaces a given deployment site with a top-ranked site if it does not satisfy the minimum distance threshold.")
    depth_min::P = Param(5.0, ptype="real", bounds=(3.0, 5.0), dists="unif",
        name="Minimum Depth", description="Minimum depth for a site to be included for consideration.\nNote: This value will be replaced with the shallowest depth value found if all sites are found to be deeper than `depth_min + depth_offset`.")
    depth_offset::P = Param(10.0, ptype="real", bounds=(10.0, 25.0), dists="unif", name="Depth Offset",
        description="Offset from minimum depth, used to indicate maximum depth.")
end

In the case of site selection without running the ADRIA ecological model, the Domain will include the criteria with keys matching those in the criteria dataframe structure. For example, if using the criteria type above, the key for in-coming connectivity would be 'in_connectivity'. The appended '_seed', '_shade' or '_tol' will indicate the parameter's usage.

In site_selection, the KeyedArray holding criteria can then be created by functions similar to those below:


function create_criteria_store(site_ids::AbstractArray; criteria...)

    criteria_matrix = site_ids
    for crit_key in keys(criteria)
        criteria_matrix = hcat(criteria_matrix, criteria[crit_key][site_ids])
    end
    return KeyedArray(criteria_matrix[:, 2:end], reefs=site_ids, criteria=collect(keys(criteria)))

end

function create_criteria_store(site_ids::AbstractArray, domain::Domain, criteria_df::DataFrame)
    criteria_names = collect(criteria_df.columns)

    criteria_names = replace.(replace.(criteria_names, "_seed" => ""), "_shade" => "")
    criteria_names = Symbol.(unique(criteria_names[!occursin.("tol", criteria_names)]))
    criteria_values = Tuple(getproperty(domain, criteria_n) for criteria_n in criteria_names)
    criteria = (; zip(criteria_names, criteria_values)...)

    return create_criteria_store(site_ids; criteria)
end

create_criteria_store takes the column names of the criteria dataframe row (which is created from sampling using ReefModCriteria), and uses these to select out the criteria from the domain to create a keyed array which will be used to construct decision matrices.

Inside site_selection, the criteria_store is created and input directly into guided_site_selection, which will create the decision matrix using criteria_store, and tolerances, a KeyedArray we mentioned above which maps criteria to filtered versions based on a rule:

function create_decision_matrix(criteria_store::KeyedArray, tolerances::KeyedArray)

    for tol_key in tolerances.tols
        rule = map(tolerances(tol_key), criteria_store(tol_key))
        criteria_store = criteria_store[rule, :]
    end

    return criteria_store
end

There would be a constructor function to create tolerances from the criteria dataframe (tolerance values would be identified as criteria_name+"_tol"), but I'm not sure how best to input the > and < specifications. Maybe this can just be an input into site_selection().

Decision matrices for particular interventions and their weightings will then be created from the criteria_store, criteria_df and the decision matrix created from criteria_store (A) :

function create_intervention_matrix(A::Matrix, criteria_df::DataFrame, criteria_store_keys::Tuple, intervention::String)
    # Get intervention criteria names
    int_crit_names = names(criteria_df)[occursin.(intervention, names(criteria_df))]
   # remove intervention name to index criteria store
    int_crit_names_store = replace.(int_crit_names,string("_",intervention)=>"")
    int_ind = [findall(criteria_store_keys.==Symbol(criteria_name)) for criteria_name in int_crit_names_store]

    # get correct columns from A
    S = A[:, int_ind]
     # get correct weights from criteria_df
    ws = normalize(Array(criteria_df[int_crit_names]))
    return S, ws
end
function create_seed_matrix(A::Matrix, criteria_df::DataFrame)
    return create_intervention_matrix(A, criteria_df, "seed")
end
function create_shade_matrix(A::Matrix, criteria_df::DataFrame)
    return create_intervention_matrix(A, criteria_df, "shade")
end

The "seed" or "shade" beside each criteria name in the criteria dataframe is used to identify the indices in A to use for the intervention matrix and the weights in criteria_df to use. The rest of guided_site_selection would be the same, with the possibility of using the matrix A and weights in a JMcDM method or one of the local mcda methods. Additionally, because we are passing in the criteria_df row, this will contain dist_thresh and min_dist, so we won't need these to be in mcda_vars and can get rid of the mcda_vars structure.

The main issue I can see with this is criteria need to be in the form they will be used in in the criteria matrix when they are stored in the Domain structure (when it is created). Because of this there might need to be separate Domains for when we are using standalone site_selection with ADRIA data vs the ecological model. But maybe we are unlikely to use standalone site selection for ADRIA without the ecological model now that we have other model datasets available?

Another issue is how to deal with heat stress, wave_stress and other time dependent layers. Do we treat heat_stress and wave_stress as separate, as we do now, and store all time steps in the domain and then select out the time step we want? This will mean that heat stress and wave stress are special cases because they cannot be put directly into the criteria_store constructor through the domain. Or do we only store the timestep we want in the domain? This will again require different domain structures for if we are doing standalone site selection with ADRIA or the full ecological model.

I recognise there's a lot of string operations which isn't great, but I reckon I can simplify what I have above by maybe finding the indices corresponding to seeding and shading in the criteria_df and criteria_store initially and passing these where needed (but also let me know if you have a better alternative).

Let me know what you think and if any of that made sense :)

ConnectedSystems commented 1 year ago

Thanks Rose, it's looking clearer. My comments below.

We discussed not using DataFrames for performance reasons.

As discussed above, please always use NamedDimsArray instead of KeyedArray. The two are functionally equivalent, but are of different types as far as Julia is concerned, and given we already use NamedDimsArray (and we have functions that expect that type), use that instead so ADRIA code is consistent.

if using the criteria type above, the key for in-coming connectivity would be 'in_connectivity'. The appended '_seed', '_shade' or '_tol' will indicate the parameter's usage.

criteria_names = replace.(replace.(criteria_names, "_seed" => ""), "_shade" => "")
criteria_names = Symbol.(unique(criteria_names[!occursin.("tol", criteria_names)]))

This will not be generic enough for our purposes. You're requiring us to go back and edit this line every time a different kind of intervention is added. Please see my comment at the end.

This is already a known issue right now (see #277).

Also, replace() already handles multiple conditions.

julia> x = "egg egg2 egg3"
"egg egg2 egg3"

julia> replace(x, "2"=>"A", "3"=>"B")
"egg eggA eggB"

For the create_intervention_matrix (fourth code block)

function create_shade_matrix(A::Matrix, criteria_df::DataFrame)
    return create_intervention_matrix(A, criteria_df, "shade")
end

Same issue as above really, this could be made much more generic - currently you'd have to come back and add another function (and its use) for every intervention have or decide to include in the future.

To resolve, you could change Domain to hold MCDA related data in a separate struct that includes metadata and select the relevant data set using that instead.

An approach that relies on string parsing is to adopt a standardized naming pattern and identify any/all relevant criteria types based on that. For example (taking a leaf from ReefMod naming conventions):

"ivconnseed" "ivconnshade"

Here, "iv" indicates an intervention. Note the double underscore to distinguish from a single underscore typically used to indicate a space.

Then match on "iv__" (excluding tol as you have above).

This is not a great approach but would "work" - ideally the Domain and MCDA-related types would work seamlessly together without having to rely on string comparisons.

Rosejoycrocker commented 1 year ago

Thanks Takuya.

My main reason for using the dataframe is to attempt to replace the use of mcda_vars with just the sampled criteria weights dataframe (in the case of standalone site selection), or param_set in the case of ADRIA the ecological model. I wanted to do this because most of the construction of mcda_vars is just reassigning elements of param_set or the sampled criteria dataframe.

A way around having to refer to particular intervention methods is perhaps then to separate out seed criteria and shade criteria prior to create_intervention_matrix(), so the input "criteria_df" (which can be changed to be a NamedDimsArray) only has criteria which will be used in seeding or shading. We need something which indicates which criteria are used in either intervention and so how I was going to do this was using the "seed" or "shade" at the end of criteria column names in the sampled dataframe (or param_set). Perhaps instead we could indicate this in the Domain? Maybe have an "interventions" variable. Which contains a Tuple of symbols indicating the criteria to be used for each intervention. The name of each tuple could indicate the intervention name (so that fieldnames(Domain.interventions) gives a list of the intervention names) which could be used to find the corresponding weights in the sampled criteria dataframe or param_set. What do you think?

ConnectedSystems commented 1 year ago

I think what you want to do is separate out the Criteria type

https://github.com/open-AIMS/ADRIA.jl/blob/77d8b3c664d59b0ba424aaaf3947987b6eaae292/src/ecosystem/Ecosystem.jl#L76-L113

I wanted to do this because most of the construction of mcda_vars is just reassigning elements of param_set or the sampled criteria dataframe.

Yes, hence why I don't think the mcda_vars is necessary.

Rosejoycrocker commented 1 year ago

Do you mean have a different Criteria type for different intervention types?

ConnectedSystems commented 1 year ago

It's not just the intervention criteria is it? Or is it just the two?

In any case you would define Criteria to be an abstract type, and then rename the current "Criteria" to something else, and add a new type, such that

SomeGenericCriteria <: Criteria
InterventionCriteria <: Criteria

Does that make sense?

Rosejoycrocker commented 1 year ago

In your example above, would you have

SomeGenericCriteria <: Criteria
SeedCriteria <: Criteria
ShadeCriteria<:Criteria

Or am I misunderstanding...

ConnectedSystems commented 1 year ago

Doesn't that mean you have to specify each intervention type yourself?

Think about what needs to happen to get the MCDA methods to work generically across any/all interventions.

Rosejoycrocker commented 1 year ago

I'm just not sure how to map weightings to criteria. So for example in the current criteria we have weightings for seeding and shading for the same criteria (e.g in_connectivity_seed, in_connectivity_shade) and then we have some which are unique to seeding and shading (coral cover and space for coral, which also have their own weightings). All of the weightings are in the same structure generated by sampling Criteria(), so we need some way of identifying which intervention they sit under. If we use criteria, we have to do some text matching to work out which are for which intervention, but this does require naming the intervention type in the criteria in some way, which is not very generic...

ConnectedSystems commented 1 year ago

Okay so we could extend the metadata attached to each criteria Param.

This would require modifying the functions which collate all the parameters (model_spec() is one, there may be other places) to ignore this Criteria-specific metadata.

wave_stress_tol::P = Param(0.2, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Low Area Tolerance", description="Tolerance for low proportional space for seeding deployments.",
        iv_type=:wave
)

Something like that.

Rosejoycrocker commented 1 year ago

Okay cool, so with iv_type, could we put a tuple, like iv_type=(:seed,:shade) (for example, because wave_stress is used for both seeding and shading, and then add intervention types if we need?

ConnectedSystems commented 1 year ago

hmmm, okay that's not going to work as neatly as I thought (because you'd have to go and specify it yourself) but might be the best we can do for now.

I think it should be okay because we've agreed that at a minimum we'd write model-specific Criteria sub-types anyway.

Rosejoycrocker commented 1 year ago

Ok cool, we can work with this for now and if we think of something better we can change.

Rosejoycrocker commented 1 year ago

Actually, sorry just thought of something. If we have,iv_type=(:seed,:shade), can we still make sure when sampling that two values are sampled (one for seeding and one for shading) or is that not possible? If we do it this way I might have to rewrite the site_selection sampler so it can sample for a specified set of interventions (so if you want weights for seeding and shading, it knows to sample 2 values for each parameter with iv_type = (:seed,:shade). Is that getting too complicated?

ConnectedSystems commented 1 year ago

If we have, iv_type=(:seed,:shade), can we still make sure when sampling that two values are sampled (one for seeding and one for shading) or is that not possible?

I think I've confused myself.

In the example above:

wave_stress_tol::P = Param(0.2, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="Low Area Tolerance", description="Tolerance for low proportional space for seeding deployments.",
        iv_type=(:seed, :shade)
)

Are you saying that if you were not to adopt this, you'd specify this twice? Something like:

seed_wave_stress_tol::P = ...
shade_wave_stress_tol::P = ...

Rosejoycrocker commented 1 year ago

Yes, so either way you have a specification of which are seeding or shading, but in the first case it's not in the variable name...

ConnectedSystems commented 1 year ago

Is wave_stress a poor example? Because I don't see an exact equivalent in the current implementation.

Regardless, the advantage of the above approach is so you don't have to do cumbersome and error prone string matching. You'd still specify the separate criteria, but filter parameters based on iv_type.

Rosejoycrocker commented 1 year ago

Yeah I think the above approach is definitely preferrable, but just pointing out that we'll have to adjust the sampling process because I think at the moment it creates one instance of each parameter for each scenario. With the new approach, in the case of , for example,in_seed_connectivity and shade_connectivity, these are weights for the same criteria just for seeding and shading. So we would define

in_connectivity::P = Param(0.2, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="In connectivity ", description="Weighting for in coming connectivity criteria.",
        iv_type=(:seed, :shade)
)

And then 2 samples will be pulled for the one parameter per scenario, one for weighting in_connectivity while seeding and one for when shading. The same would happen for zone_shade and zone_seed and seed_priority and shade_priority. Unless we want to reduce the complexity and make the weightings the same for different interventions, but this may not be desirable for some interventions.

ConnectedSystems commented 1 year ago

What I meant was you'd specify the two anyway:

# except make the names/description more informative of course
in_seed_connectivity::P = Param(0.2, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="In connectivity ", description="Weighting for in coming connectivity criteria.",
        iv_type=:seed
)
in_shade_connectivity::P = Param(0.2, ptype="real", bounds=(0.0, 1.0), dists="unif",
        name="In connectivity ", description="Weighting for in coming connectivity criteria.",
        iv_type=:shade
)

EDIT: Perhaps a more straightforward approach is to match on the type fieldname (in_shade_connectivity)? You can rename the fieldname using a pre-defined convention as we discussed above: (e.g., iv__shade__in_conn or similar)

Rosejoycrocker commented 1 year ago

Ah I see, ok cool, that makes things simpler.

ConnectedSystems commented 1 year ago

Please see my edit above

Rosejoycrocker commented 1 year ago

Sorry, could you expand on what you mean by match on the type fieldname(in_shade_connectivity) ?

ConnectedSystems commented 1 year ago

in_shade_connectivity was an example of a fieldname.

struct Example
   A  # <- this is a field, hence the fieldname is ":A"
end

Suggest you look up the documentation for fieldname() and getfield() as these will give you ideas on how to write functions to help access specific fields cleanly.

Rosejoycrocker commented 1 year ago

I know that it's a fieldname and you can use getfield to access a field of a particular name from a structure, but do you mean using this in place of iv_type?

ConnectedSystems commented 1 year ago

Yes, that's what I meant by "Perhaps a more straightforward approach is to match on the type fieldname" You could use iv_type, but this way you don't have to change anything else though as I mentioned earlier.

In case you were wondering, it's different from the example you had above as you're not relying on a dataframe of names.

However, this snippet here ...

criteria_names = replace.(replace.(criteria_names, "_seed" => ""), "_shade" => "")
criteria_names = Symbol.(unique(criteria_names[!occursin.("tol", criteria_names)]))
criteria_values = Tuple(getproperty(domain, criteria_n) for criteria_n in criteria_names)
criteria = (; zip(criteria_names, criteria_values)...)

... would mean conn_seed and conn_shade both resolve to conn. Was this intended?

Rosejoycrocker commented 1 year ago

That's because there's only one vector of in_connectivity values for each site (which is used to construct decision matrices), but it has 2 weights (in_connectivity_seed and in_connectivity_shade). So the idea was to use the "seed" and "shade" identifiers later to construct the weights vectors, but for the criteria matrix you don't use the seed and shade identifiers because you have overlap.

But with getfield I still don't quite understand how you'd do it. Can you use getfield with a wildcard, like getfield(struct,*seed)? The code above is used to identify the fields in Domain which contain criteria, as these don't have the same names as the weights in Criteria(). I thought if we have a consistent naming convention we'd be able to match the two, but there's still the issue that Criteria() doesn't map one-to-one with the criteria in Domain.

ConnectedSystems commented 1 year ago

Can you use getfield with a wildcard, like getfield(struct,*seed)?

You would define your own getfield() if this was the approach you wanted to take but I'd prefer to match against known identifiers. This is because something may change later (e.g., we add seed_field_that_should_not_match) and we get erroneous matches without noticing.

The code above is used to identify the fields in Domain which contain criteria, as these don't have the same names as the weights in Criteria()

The purpose of using a consistent naming scheme is to side-step that issue. Is this related to the discussion around the MCDA_vars type?

Rosejoycrocker commented 1 year ago

Even with a consistent naming scheme though, you need something which tells you which names are being used as criteria. If we get the list of names as fields from Criteria which contain "seed" or "shade", then to access them in domain we'd have to use the name without "seed" or "shade" in it. Unless we store the same criteria vectors under different names with seed and shade in them, but that seems like a waste.

ConnectedSystems commented 1 year ago

Even with a consistent naming scheme though, you need something which tells you which names are being used as criteria.

You mean like iv__*?

Rosejoycrocker commented 1 year ago

so I would write something that can use getfield with a wildcard, like getfield(Domain,iv__*) ?

open-AIMS / ADRIA.jl

Make MCDA functions generic #252