Decide on choice of fusion variables and best approach

We likely want/need to change the set of fusion variables. I wrote the script to exclude a variety of donor variables that are either variants of other variables (e.g. different units) or just not relevant to any imagined use cases. My goal was to make things manageable for testing purposes.

One particular challenge is how to include fusion variables that are components of a larger consumption "summation" variable (like total electricity). Ignore expenditures for the moment (see #).

Option 1: Model only the component variables (electricity for space heating, etc.), then sum them afterwards to create the total electricity variable -- or see if fuse() correctly detects the linear relationship and calculates the total electricity variable itself (probably won't work and isn't necessary anyway). The problem, though, is that total electricity consumption, say, is likely a powerful predictor in general -- more powerful than the components. Moreover, there's no guarantee in this case that the distribution of total electricity will have an accurate distribution when it is a summation rather than directly modeled. And ultimately we really care about total electricity consumption more than the component values. I think this is worth trying, at least. The key is to determine if the total electricity variable summed from the components looks realistic or not.

Option 2: Do an initial fusion that ignores component variables; i.e. first do a fusion with total electricity along with all other non-component variables. Then do a second fusion of just the component variables -- maybe scaled so they are proportions that sum to 1 for each household (just to make things easier). Total consumption acts as a predictor variable in the second fusion step. The fused/simulated proportions will not sum to 1 at the household level (though hopefully they aren't wildly far from 1), so you will have to scale them post-fusion to sum to 1 then multiply by total consumption. The advantage here is we have more confidence in the distribution of the total variable, which we probably care more about that the component variables.

I don't think there is any substitute for just trying both of these approaches and seeing which generates more defensible results.

Regardless of approach, try to keep the number of component variables to the minimum relevant for expected use cases. I suspect fusion will work better, in general, if there are only 5 components summing up to total electricity as opposed to 10.

ummel / fusionData

Decide on choice of fusion variables and best approach #44