Make MCDA functions generic

ConnectedSystems commented 1 year ago

Make MCDA functions generic to any number of criteria, and use auto-generated names as opposed to the current index-based approach which is prone to errors when making adjustments.

create_decision_matrix
create_seed_matrix

Could make changes to adopt JMcDM as suggested in #242 instead.

ConnectedSystems commented 1 year ago

If you need all "intervention" related fields, that's the idea yes. Doesn't have to be a type-specific getfield either, that was just a off-the-cuff suggestion. Just a function would suffice here too.

Or you could use the iv_type idea discussed above and find all parameters with the matching entry for that.

Rosejoycrocker commented 1 year ago

Ok cool, thanks I think I get it now :)

Rosejoycrocker commented 1 year ago

Hey @ConnectedSystems, for your ReefModDomain, do you create a new constrcutor function, or does the old constructor work ? Just wondering because if I add criteria to say ReefModDomain or ADRIADomain then I'll have to add different constructors, which seems complicated...

ConnectedSystems commented 1 year ago

Has to be a new constructor for each Domain subtype.

Just wondering because if I add criteria to say ReefModDomain or ADRIADomain then I'll have to add different constructors, which seems complicated...

Could you elaborate on this a little?

Rosejoycrocker commented 1 year ago

Right, ok. Just because there's 2 functions which work together to construct the domain, and it seems like I'll have to re-write variations of both of them to fill the intervention criteria keys in the domain structure. Just wanted to check there wasn't an easy way around this.

ConnectedSystems commented 1 year ago

It sounds like what I suggested earlier would address this perhaps, but requires the structure of Domain to change too. In fact, I think that's preferable because the way things are at the moment is really messy and ugly.

Instead of this:

https://github.com/open-AIMS/ADRIA.jl/blob/2f926ada0c9487f0a18e878c9d0ffe9eabff1e5e/src/Domain.jl#L209-L211

You'd have:

Domain(name, rcp, env_layer_md, ..., some_type_name_that_indicates_relation_to_MCDA_and_also_holds_the_info_you_want, ...)

Then it's clear whats relevant for the MCDA.

ConnectedSystems commented 1 year ago

Oh - it doesn't have to be struct/type, probably a namedtuple is what you're after.

Rosejoycrocker commented 1 year ago

For example though, if one of the new Domain variables is iv__priority or iv_zones, which require about 4 or more lines to be calculated, should these be calculated in the constructor or left as placeholders which can be filled later?

ConnectedSystems commented 1 year ago

Wait hang on - could you give an example of what those are for?

Is there a need for them to be part of Domain or was it just a placeholder/holdover from MATLAB?

Rosejoycrocker commented 1 year ago

These are criteria which are eventually part of the criteria matrix for mcda. ivpriority is the priority predecessors criteria and ivzones is the priority zones criteria. I just thought they'd be stored in the Domain because we talked about storing all criteria in the domain for standalone site selection.

ConnectedSystems commented 1 year ago

I just thought they'd be stored in the Domain because we talked about storing all criteria in the domain for standalone site selection.

The thinking there was that the relevant parameters have to at least be accessible via the Domain.

But specifically the ones you mention here, are they calculated just once and used throughout or are they to be calculated before selection/model runs?

What I'm trying to determine is how Domain.some_mcda_parameter_set would work and how appropriate it is to shove everything in there if they're going to be recalculated or thrown away anyway.

Rosejoycrocker commented 1 year ago

Yeah they're just calculated once but the calculation is about 4 lines for iv__priority and longer for iv_zones. Should the ones which are recalculated, like coral cover, be put as placeholders in Domain and then added when needed?

ConnectedSystems commented 1 year ago

Should the ones which are recalculated, like coral cover, be put as placeholders in Domain and then added when needed?

Specifically in the context of the standalone deployment location identification process? Yes, like how we have pre-allocated arrays for use with the ODE. But may be not in Domain itself but within a MCDA-related field? That would clean up the mess of a constructor.

Rosejoycrocker commented 1 year ago

So maybe in the standalone process we include variables for each of the criteria (which won't have to be recalculated because there's no time dependence) and then in the ecological model we maybe just construct the NamedDimsArray of criteria first and then update this- does that sound reasonable?

ConnectedSystems commented 1 year ago

Yep, reasonable

Rosejoycrocker commented 1 year ago

Hey, a question regarding how you want to use run_site_selection with ReefMod - do you need (or generally see the need in potential future usage) to be able to have different initial coral covers for a given number of criteria samples and domain? We currently have this but as I'm making changes I thought I'd see if there's anything we're unlikely to need...

Rosejoycrocker commented 1 year ago

Just tagging you @ConnectedSystems (and no rush to read because I know you're working on that presentation this afternoon), because I'm floating some better formats for the output of site_selection and run_site selection (currently unintuitive and difficult to use when aggregating data).

I'm considering changing the outputs to (no. of scenarios, no. of sites, 2) and instead of ranks as the entries we have the site names in order from highest ranked to lowest ranked, following the same syntax as keys for the NamedDimsArrays holding results such as coral cover etc. Sites which are filtered out will be replaced with a placeholder (maybe NaN or whatever you think is best) so that the size of each scenario set is consistent. These scenario vectors of site names could then be used to directly index results arrays by discarding placeholder values. Ranks can then instead be the keys for the 2nd dimension of the output matrix so if you want the n highest ranked sites you access site_selection_output[ranks=collect(1:n)].

ConnectedSystems commented 1 year ago

do you need (or generally see the need in potential future usage) to be able to have different initial coral covers for a given number of criteria samples and domain?

Yes, potentially.

Sites which are filtered out will be replaced with a placeholder (maybe NaN or whatever you think is best) so that the size of each scenario set is consistent.

We were using $n_{sites} + 1$ to indicate locations which were not considered, but this wasn't consistent.

I'm considering changing the outputs to (no. of scenarios, no. of sites, 2) and instead of ranks as the entries we have the site names in order from highest ranked to lowest ranked, following the same syntax as keys for the NamedDimsArrays holding results such as coral cover etc.

I can't quite visualize what's meant here so bear with me. If the current structure is $\text{scenarios} \cdot \text{sites} \cdot \text{[seeding, shading]}$ (where the final dimension is to be expanded to three to explicitly indicate fogging vs SRM) then what would the new structure be? $\text{ranks} \cdot \text{scenarios} \cdot [\text{seeding, fogging, SRM}]$ ?

Rosejoycrocker commented 1 year ago

Ok cool, I'll keep the potential for multiple coral covers in- I'll do this by creating the criteria storage structure initially from the domain and then replacing the coral_cover which is loaded from the domain for each new site selection scenario (in the case of run_site_selection).

Also, I'm using the most current version of run_site_selection from Main (through a recent rebase), and there is an additional site filter so you end up with considered_sites being your final vector of site_ids to be considered in site_selection. Is this something added for ReefMod? I'm currently testing with ADRIA data and had to comment it out for testing because I got empty site_id vectors. What is the correct way to use the considered_sites variable?

Rosejoycrocker commented 1 year ago

I can't quite visualize what's meant here so bear with me. If the current structure is scenarios⋅sites⋅[seeding, shading] (where the final dimension is to be expanded to three to explicitly indicate fogging vs SRM) then what would the new structure be? ranks⋅scenarios⋅[seeding, fogging, SRM] ?

Yeah, so you might have an output that looks something like this for 3 scenarios (I've only shown the seeding dimension):

3-dimensional NamedDimsArray(KeyedArray(...)) with keys:
↓   scenarios ∈ 8-element UnitRange{Int64}
→   ranks ∈ 216-element Vector{String}
◪   intervention ∈ 3-element Vector{String}
And data, 8×216×3 Array{Float64, 3}:
[:, :, 1] ~ (:, :, "seeding"):
           (rank 1)                 (rank 2)                   (rank 3)                  ...      (rank 216) 
 (1)   "Briggs_BR_C_1"       "Briggs_BR_OF_1"         "Briggs_BR_OF_2"       ...       "Thetford_TR_S_7"   
 (2)    "Briggs_BR_OF_1"      "Briggs_BR_C_1"           "Briggs_BR_OF_2"      ...                 NaN
 (3)     "Briggs_BR_OF_1"     "Briggs_BR_OF_2"            "Briggs_BR_C_1"     ...                 NaN

[:, :, 2] ~ (:, :, "fogging"):
...

I've used NaN here to indicate that some sites were filtered, so there's a smaller set of ordered sites than 216. Each row is a scenario containing site ids ordered according to how they were ranked. So the top 3 sites in the first scenario were [ "Briggs_BR_C_1" "Briggs_BR_OF_1" "Briggs_BR_OF_2" ]. Does that kind of make sense?

We don't have to use the full site names, we could use indices of the site_ids vector instead if that is easier in terms of using them in results aggregation.

ConnectedSystems commented 1 year ago

Is this something added for ReefMod?

Yes - see usage below. The reason was that although ReefMod simulates the GBR, for the specific simulations we were working with we only wanted to deploy within specific regions.

From memory I ran into issues with an empty location list but this was due to a different reason somewhere in the MCDA process (can't remember what). You should do a comparison between the "ADRIA-ReefMod" branch and "Main" because there's a catch in run_site_selection() to consider all available locations if the list of target locations are empty.

This was a tentative implementation (read: I needed something to work quickly) and definitely needs to be cleaned up. I dislike having a this potentially infinite number of optional keyword arguments (and yes I know I used "sites" again - these should be "locations", like I said, rush job).

# Target priority sites 
# - must be in the small reef size class
# - must be in the Cairns region
# - must have > 0.01 km² of `k` area
subset_of_sites = findall((dom.site_data.size_group .== 1) .& (dom.site_data.cairns_region .== true) .& ((ADRIA.site_k_area(dom) .* 1e-6) .> 0.01))
@info "Selecting from $(length(subset_of_sites)) locations"
ranks = ADRIA.run_site_selection(dom, scens, sum_cover, area_to_seed, ts; target_seed_sites=subset_of_sites')

Rosejoycrocker commented 1 year ago

Is this something added for ReefMod?

Yes - see usage below. The reason was that although ReefMod simulates the GBR, for the specific simulations we were working with we only wanted to deploy within specific regions.

From memory I ran into issues with an empty location list but this was due to a different reason somewhere in the MCDA process (can't remember what). You should do a comparison between the "ADRIA-ReefMod" branch and "Main" because there's a catch in run_site_selection() to consider all available locations if the list of target locations are empty.

This was a tentative implementation (read: I needed something to work quickly) and definitely needs to be cleaned up. I dislike having a this potentially infinite number of optional keyword arguments (and yes I know I used "sites" again - these should be "locations", like I said, rush job).
# Target priority sites 
# - must be in the small reef size class
# - must be in the Cairns region
# - must have > 0.01 km² of `k` area
subset_of_sites = findall((dom.site_data.size_group .== 1) .& (dom.site_data.cairns_region .== true) .& ((ADRIA.site_k_area(dom) .* 1e-6) .> 0.01))
@info "Selecting from $(length(subset_of_sites)) locations"
ranks = ADRIA.run_site_selection(dom, scens, sum_cover, area_to_seed, ts; target_seed_sites=subset_of_sites')

Ok cool, thanks. I'll check out "ADRIA-ReefMod" to understand it and integrate it with what I have. I also need to go through my code and make sure I haven't accidentally referred to sites instaed of locations anywhere :'D.

ConnectedSystems commented 1 year ago


3-dimensional NamedDimsArray(KeyedArray(...)) with keys:
↓   scenarios ∈ 8-element UnitRange{Int64}
→   ranks ∈ 216-element Vector{String}
◪   intervention ∈ 3-element Vector{String}

There's still the question of what the user is meant to do with the results here. What are your current thoughts?

Rosejoycrocker commented 1 year ago

My thoughts were because they are the site ids or indices in order of their rank you can use them directly to index a result set. So if you wanted to access coral cover for the top 10 ranked sites you just take the site ids for the top 10 directly. Trying to address your comment that "Currently run_site_selection() requires the user to do a lot of gymnastics to get a usable list of deployment locations." in #324 .

ConnectedSystems commented 1 year ago

What would top 10 mean though? If they run 5 different scenarios with 5 different top 10s, what would the next step be?

Rosejoycrocker commented 1 year ago

Top 10 ranks. So you could compare the coral cover at the top 10 ranked sites for each scenario. You could also use this to calculate the frequencies of selection for different sites.

ConnectedSystems commented 1 year ago

This goes back to the "gymnastics" comment though. Are you envisioning the user does all this by hand or are you thinking of a report function that provides those stats?

How would the user judge what the frequencies mean / relate to ?

Rosejoycrocker commented 1 year ago

I can write a function that does that, from your comment it just seemed like you wanted changes to the output format of site_selection. But I can do that.

ConnectedSystems commented 1 year ago

Well keep in mind this issue and discussion therein: https://github.com/open-AIMS/ADRIA.jl/issues/324

See the last bit of my first post

Edit: the bit starting with this paragraph

Then run the selection process with 128 - 1024 criteria combinations and picking out locations that were ranked the most times for a given rank position.

ConnectedSystems commented 1 year ago

I can write a function that does that

What I'm wondering is more to do with appropriateness. Is this what you want the user to do / how you want the user to use the rankings?

Rosejoycrocker commented 1 year ago

Well I thought it would be better to have the site orderings as raw output so there's flexibility and then have a series of aggregation functions which can be used if further analysis is needed, like a site frequencies one etc

Rosejoycrocker commented 1 year ago

Hey @ConnectedSystems, just an update on using a Dirichlet distribution for the weighting parameters. I had a better look into the parameterisation of the distribution (it's been a while since I've used one). The parameter it uses, $\alpha$, is a concentration parameter, so it determines whether each variable in the multivariate distribution has a homogenous or non-homogeneous spread. To get an effectively uniform distribution across each variable we'd use $\alphai=1$ $\forall i=1,...,n{var}$, where $n_{var}$ is the number of weights. This should be farily simple to set up, so if you're happy I can start to implement it.

https://en.wikipedia.org/wiki/Dirichlet_distribution

ConnectedSystems commented 1 year ago

Wait sorry, how will this be implemented?

If it's simply changing the distribution of the weights, you just have to modify the Params entry for the Criteria type.

Rosejoycrocker commented 1 year ago

It's a little bit more complicated because its a multivariate distribution, so you have an $\alpha$ for each of the weights (it will be 1 for a uniform marginal dist) but the sample is drawn as a vector which will sum to one. I'm thinking maybe we have dists="dirichlet" and then alpha=1 instead of "bounds" in Ecosystem. Then we add some extra lines in sample() for dists = "dirichlet", which gathers all alpha values for variables with this type of distribution and then draws a sample vector for each realisation. Happy to leave this for a separate issue though if you'd prefer?

ConnectedSystems commented 1 year ago

Yes, I say leave it for now. Sobol' is a deterministic sampling approach and mixing this in with it will destroy the sampling scheme.

I'm also unsure how this addresses the normalization issue so I'll have to read up on it.

Rosejoycrocker commented 1 year ago

Ok cool, I'll leave it. I realised it's also a little more complicated than what I wrote above because we'd have to draw 1 vector for the seed weights and 1 for the shade weightsa, etc, which is difficult because we have overlap in these parameters (e.g. the heat stess and wave stress weights are both used for seeding and shading, which would break the independence). There's a way of drawing the sample based on Gamma distributions so you reduce it to a univariate distribution problem and then normalise with a value drawn from the Gamma distribution with alpha value as the sum of the other gamma distribution's alpha values... but yeah more complicated than I first thought.

Should I write up an issue?

ConnectedSystems commented 1 year ago

Please go ahead and raise an issue.

Rosejoycrocker commented 1 year ago

Hey @ConnectedSystems, so I just read through our previous comments to work out what I've addressed and what I haven't and we mentioned maybe making the site selection functions generic to intervention. This is quite built in at the moment so I thought I would detail how I think it could be done before going ahead. At the moment guided_site_selection (now called guided_location_selection) takes in prefseedsites, prefshadesites, logseed and logshade. I'm thinking that as we already use logseed and logshade in scenario to decide when to seed and shade, we remove these as inputs in guided_site_selection and put calls to guided_site_selection inside these shading/seeding if statements. Instead of prefseedsites and prefshadesites inside guided_site_selection, we just have prefsites and an additional string which indicates the intervention type (which will match the ends of the weights names in Criteria{}). We then call guided_site_selection twice if we are choosing sites for seeding and shading separately in scenario and once if we are using the same sites for seeding and shading. What do you think?

ConnectedSystems commented 1 year ago

Note that you need to separate fogging from shading as well
I think we need to keep the process to consider all interventions together so the decisions are made as holistic as possible. Particularly important when we consider economics: we want to maximize coverage (spread the deployment locations out) while balancing costs (the further spread out locations are, the most deployments will cost). I'm not sure yet if ships/boats used for deployment can do all interventions (shade/fog/seed) or if a ship is configured for only one kind of deployment only.

Regarding the code itself, please remember to update variable names to snake_case as you come across older naming styles.

Rosejoycrocker commented 1 year ago

Ah yes, I'll look into separating fogging and shading.

What do you mean by the process to consider all interventions together? Currently the seeding and shading decisions are either evaluated separately or the same set of intervention sites is used for seeding and shading (decided using the seeding matrix). Do you mean the fact that very similar decision matrices are used for seeding and shading? If so, we are definitely keeping this.

ConnectedSystems commented 1 year ago

What do you mean by the process to consider all interventions together?

Sorry, getting ahead of myself. What I was thinking was that depending on how the ships are configured, it may be that we have to consider the deployment locations for seeding/shading/fogging all together. For example, if a single ship can fog and seed then we'd want fog/seed locations to be somewhat in close proximity to reduce costs. I don't know if this is actually the case, it's just an option we'd want to keep open.

Rosejoycrocker commented 1 year ago

Right, I see, thanks. One way of incorporating this is allowing multiple interventions to be considered in one decision matrix, so for example criteria tagged as 'shade' and 'fog' are all used in the same matrix which is solved for ranks. In terms of 'close by' sites instead of sites that satisfy criteria for both interventions, you could probably select for one intervention first (such as seeding) and then maybe do distance sorting of the sites for the second intervention based on their distance from the sites selected for the first... but this would take some alterations to the distance_sorting function

Rosejoycrocker commented 1 year ago

Hey @ConnectedSystems , I'm in the process of implementing shading separately to fogging. I'm wondering if we need a separate variable to n_site_int which designates how many clusters to shade. For example the Moore set has 5 clusters, so you'd just shade everything if you select the 5 best clusters to shade. Should I add a new variable or should we start simple and assume you only shade one cluster? (so you pick the highest ranked in the mcda)...

ConnectedSystems commented 1 year ago

Start simple and just re-use n_site_int for now please. For Moore it makes sense to shade everything because it's a small area.

Rosejoycrocker commented 1 year ago

I'll do that, thanks

Rosejoycrocker commented 1 year ago

Hey @ConnectedSystems , do we want to keep track of the shading ranks like with seeding and fogging? If so, it makes things a bit complicated because the size of ranks is the number of sites not the number of clusters. I could just store in the upper part of the matrix but then the site ids wouldn't correspond to anything. I'm trying to avoid adding a special case so it's easier adding interventions in the future...

Rosejoycrocker commented 1 year ago

Flagging this PR is ready for review, but it's probably best to resolve #329 fist so I can correct the definition of connectivity in the mcda criteria.

ConnectedSystems commented 12 months ago

Closing this issue as I think this has become stale or otherwise superseded by other work.

open-AIMS / ADRIA.jl

Make MCDA functions generic #252