marcov64 / Lsd

Lsd repository on GitHUB
35 stars 22 forks source link

Randomised processing of objects #10

Open FrederikSchaff opened 6 years ago

FrederikSchaff commented 6 years ago

In cases where current information is relevant for the process, the fixed-order updating in LSD is problematic. Currently this is circumvented by providing an explicit cycle over objects and doing this in a random fashion. It would be good if instead one could flag objects as "randomised" (or this was the default). Also, a random cycle macro would be useful.

One way to implement this could be to change the current updating scheme as follows: First, create a quee of all the variables that shall be updated. Next, randomise it. Then process it as usual. Problems like copying with deleted objects have to be taken care of. This could be done by combining it with the delete flag suggested in the issue #9 "Delete Self"

MCP1 commented 6 years ago

I understand your point. It can be useful in certain situations to be able to randomize the order of evaluation of variables, when more than one sequence is available. But there is a big if to be answered before: are the possible alternative computation sequences equivalent (in terms of representing the same model)? The general answer is no. Quite frequently, even a deterministic model (no stochastic components) produce different results if computed in using different valid sequences. Currently, LSD pick the first "route" it finds and keep it. Is this the "best" one? We don't know, but there is a simple strategy in place (see below) and one consistent result. Which are really the alternative routes? It depends heavily in the model structure itself. How could you reproduce your results or perform a Monte Carlo experiment in this situation? Imagine tracking a bug when every execution can take a different route? Or trying to understand a complex emerging result? Last, but pragmatically most important, this would require a complete redesign of the LSD scheduler. Currently, we use a recursive scheduler: we pick the first variable at the topmost object and compute it entirely, including any dependence it may have. Next, the second topmost variable is computed, if not already evaluated when the first one was calculated. And so on until the last variable in the last object at the lowest level is updated, directly (in the top-down order) or not (previously computed by an upstream update). In a typical model, more than 80% of the variables are updated by recursion, on-the-fly, and such order cannot be controlled programatically. And the other 20% are more likely statistics-gathering variables, to which the computation order is not relevant anyway (but cannot be computed before its dependencies). For all those reasons, I guess such a change may be of limited value despite incurring in significant coding and testing demands. On the other hand, designing the relevant (as per the model logic) multi-instance variables to be computed in random order is relatively simple to implement. If you have a particular situation in mind, let us know so we can suggest an approach. Most of the time, randomly sorting the desired object chain every time step would do. As a side note, maybe just allowing the parallel computation of the desired multi-instance variable will produce the desired results in some situations. When computing these variables in parallel there is not an strict similar updating sequence every time step.

FrederikSchaff commented 6 years ago

I will try to respond to all issues raised.

The biggest point are the pragmatic issues. I think it is possible to find a solution. A first simple help would be a random-order cycle (and cycle-safe). I have implemented something in the GIS enhancement I am working on and could implement it right away. This would make manual control more easy. But if you agree, I could also think about a complete randomisation of the default updating scheme (with a switch to turn it on/off).

I am sure that if we want to get more people on using LSD, this is a crucial point.

MCP1 commented 6 years ago

Hi Frederik. Thanks for the detailed analysis.

I'm still not convinced and, anyway, I don't have the time to invest on this direction for now, as it is a significant effort. However, feel free to adapt LSD to your needs. If you come up with a solution that is end-user-ready, we'd be very happy to incorporate into mainline LSD.

Just to provide more basis to my feelings, some comments on your points (no answer is necessary).

  1. Not many models control for that. I don't remember of any. Mine, never.
  2. Ok, but you can't eat the cake and have it. To control order you have to care about the possible orders and the current scheduler model is no longer useful. Also, we would need algorithms to find the set of all possible orders before picking one randomly in a robust way, which I guess is not a problem we can solve in polynomial time. Randomizing over an incomplete set of possible orders looks like a bad idea from a sampling/experimental perspective..
  3. This is the point when you are studying the particular subject of alternative processing orders, as you are. Most models simply don't want to deal with it, as the complexity of controlling this kind of stuff grows exponentially with the model size. The current recursive scheduler only provide a "good-enough" evaluation sequence for most applications, as practice shows, and is flexible enough for modelers to control order when required.
  4. If it is cumbersome when you know what you want to randomize, imagine doing it for any code, even if we assume it is feasible in the first place. The approach you suggest seems limited both in terms of performance for any larger model (I'll be randomizing blindly over things nobody wants randomized) and sample representativeness (as it does not exhaust all the possible orders, why are the ones proposed the good/representative ones?) Seems to me more efficient to take care of order when it is relevant and custom design a solution for the problem a hand.
  5. True, but as soon as you turn the switch on, no way to find the remaining bugs not exercised by the default order. Bugs appearing only in particular the execution orders of equations are quite common.
  6. First, you control when parallelize or not, at at the object level. Second, knowing ex ante which sequences are critical in practice is very unlikely before facing real problems for larger-size models (and not a top question for most modelers). Third, you don't parallelize things that can't be parallelized, as LSD tells you (or crash...). Fourth, the seed and the random number generators operate exactly the same under parallelization, you simply don't control the order object instances are updated (as you are proposing to do).