Data input, parameter bounds and runif

mathesong commented 6 years ago

I spotted your package when it popped up on my Twitter feed, and it looks really great, and it solves a problem which I've been looking to get to for my pharmacokinetic modelling package (https://github.com/mathesong/kinfitr). As such, I would very much like to make use of your package in mine. There are just a few changes/enhancements which I would like to make: I'd be happy to submit them as a pull request. Just wanted to check if they were ok for you first.

(by the way, I've never made a PR before, so please tell me if I should be doing anything differently here)

Data input argument

I would like to be able to include the data as vectors rather than as columns of a vector in some cases. I see that the minpack.lm package has achieved this by using parent.frame() as the default input argument. This allows that the data can be flexibly entered either in vector form in the formula, or in the data frame.

Parameter bounds

You implemented the parameter bounds input as a vector of lower1, upper1, lower2, upper2. This differs from the underlying minpack.lm input, and makes it a little bit harder to directly use the parameter bounds inputs which might have gone directly into minpack.lm as these inputs. Is there a special reason for putting them together? If not, would it be ok for me to separate the upper and lower input arguments for consistency with minpack.lm below?

runif shotgun

You implemented a shotgun approach selecting from a uniform distribution for choosing values for input arguments. Another possibility for scanning across the parameter space would be to divide each parameter's upper and lower starting parameter limit into x equally spaced units, and then fitting the model using each of these starting points. This would ensure that the model was fitted using starting parameters which covered the whole bounding space. I would like to include this option for having fewer iterations, but still covering the full space.

Again, the package looks great, and I'm really excited to try to implement it in my functions.

All the best, Granville

padpadpadpad commented 6 years ago

Hi Granville

Thanks for this post and using nls.multstart. Firstly, I have very rarely done PR so I do not really know what best practice is anyway!

All of these suggestions sound sensible to me. In regards to the parameter bounds I am not sure why I really did it that way in the first instance! Made sense in my head at the time I imagine.

Keep me updated as to how you are doing and if there is anything I can do. I am not sure how interpretable / optimal my code either but hopefully it isn't too bad.

It might be easier to implement each of the changes in turn and submit PR for each?

Cheers Dan

PS. nls2 might also be useful to you for fitting non-linear models with multiple start values. I still get more reliable and reproducible results with the methods in nls.multstart but probably at the cost of a fair bit of speed.

mathesong commented 6 years ago

Hi Dan,

Thanks for the quick response!

Re the parameter bounds, it is a pretty intuitive way to specify them when thinking it through for a problem, as one thinks about the bounds parameter by parameter. But as soon as I was doing it programmatically and looking at my vectors of upper and lower limits for the fitting procedure, it became a lot messier to translate them into that structure.

Also, thanks for directing me to nls2 - I had a look, but I prefer your implementation. But there are actually some nice ideas to be gleaned from there! Will keep it in mind.

Really looking forward to it! I'll keep you informed. And great idea re a PR for each!

Cheers, Granville

padpadpadpad commented 6 years ago

Update

First PR has been accepted so that:

data can be accepted as vectors
the specification of the formula/model is in line with minpack.lm and nls2

I shall have a go with a new way of specifying parameter bounds at the weekend. Maybe with start_low() and start_high().

mathesong commented 6 years ago

Hey hey,

I spend a little time looking at making a gridstart approach, and have implemented it. By specifying a single value for iter, it goes with the current shotgun/random search approach, and by specifying a vector for iter of the same length as the number of parameters, it creates a grid of equal spacing through each parameter's range and combines them into a set of starting parameters.

In the process, I had to take creating of the starting parameters out of the for-loop, and specify them all before starting. I realised that this also means that I can do the fitting using a purrr::map() command to go through the different starting parameters. What this additionally means is that it should be really easy to integrate using future to parallelise the process. Also, it should shorten the code quite a bit.

However, this leads me to a few questions.

There's a count, which checks if the AIC hasn't been beaten for the last 100 fits. This is a cool way to have a convergence-like criterion so that it can stop earlier if the current AIC is not being improved upon - I guess this is the reason for it? This would not be possible though if using map(), and wouldn't be possible to parallelise. However, it also wouldn't be possible if using a gridstart approach at all, since the main point of this approach would be to have consistency across different fits (as one might get a low value for one particular fit if one were unlucky with the runif() parameter choices), and so it would be important to do all the iterations. This just means that people must think a little bit more before deciding on the number for iter. Would it be ok for me to remove the count?
When you do the fits, you save only the AIC, and not the model. Is this to make sure that memory is not filled up? I was originally thinking I could implement the map() command to create a nested column with all the fits, and then mutate() an AIC column (or even extract the coefficients with map_df() ) and select the winning fit. Otherwise, I could fit, and extract both coefficients and AIC for each as it goes along, and then re-fit the model for the winning starting parameters. I think I prefer the second implementation. Also means that we could potentially make a different criterion for the winning model at a later point, being the model which ranks closest to the median for all of the parameters. Can I go ahead with this?
As far as I can understand from the wiki page (quoted below), the AICc makes an adjustment for limited sample sizes from the AIC, however the delta AICc for the same n and same number of parameters should be identical to that of the AIC for the same data. This would mean that the AIC and the AICc will always yield the same answer for selecting the winning starting parameters. Please correct me if I'm wrong, but if this is the case, then we could potentially remove MuMIn from the dependencies, and it would be one less package for people to have to install.

Note that if all the candidate models have the same k and the same formula for AICc, then AICc and AIC will give identical (relative) valuations; hence, there will be no disadvantage in using AIC, instead of AICc.

I'll get onto this as soon as I have your okay to implement these changes.

Cheers, Granville

padpadpadpad commented 6 years ago

Hi Granville

You have been busy! 🔥 😄 🔥

What code are you using to create the grid?

Suppose a few of these implementations depend on how we define what the "best" model fit is for a given dataset. I have only ever used AIC score for this tbh.

That is exactly what count does. Essentially if the fit is not improved upon, in terms of AIC/AICc score for 100 iterations of the model, the loop finishes and ends. I had a long think about how to un-for loop this process but had not come up with a good solution. There are times I have 100s of models to fit and to not have the each model finish early if it is already at the "best" set of parameter values would be very time-consuming.

I suppose it was easily parallelised that would offset this. Also if it did a grid_search sequentially then it would offset some of those problems but would be nice to do some speed tests on each implementation to see differences.

I have also never got unreliable parameter values fitting these models when I have set reasonable parameter boundaries and >500 iterations, even with the cutoff imposed by count.

It is not to save memory. I am only ever interested in the "best" model fit, so do not save every model iteration, it only saves the current "best" model.
This is correct. I ranked and returned AICc for a previous iteration of the package, nlsLoop, as it provided a better model comparison if comparing different models. Now that all model fits are returned this is not necessary so yes I think this could be removed.

Cheers Dan

mathesong commented 6 years ago

Hi Dan,

Haha, thanks!

1 The code to create the grid and the uniform randoms is as follows:


# Data
iter=3

params_bds <- tibble::tribble(
    ~param, ~low.bds, ~high.bds,
     "lnc",      -10,       10L,
       "E",      0.1,        2L,
      "Eh",      0.5,        5L,
      "Th",      285,      330L
)

# Code to implement
if(length(iter) == 1) {

    # Shotgun
    strt <- purrr::map2(params_bds$low.bds, params_bds$high.bds, ~runif(iter, .x, .y))
    names(strt) <- params_bds$param
    strt <- dplyr::bind_rows(strt)

  } else if( length(iter) == nrow(params_bds) ) {

    # Gridstart
    params_bds$iter <- iter

    strt <- purrr::pmap(as.list(params_bds[,-1]),
                        function(low.bds, high.bds, iter) seq(from=low.bds, to=high.bds, length.out=iter))
    names(strt) <- params_bds$param

    strt <- expand.grid(strt)

  } else {
    stop('iter should be of length 1 for shotgun approach and of the same length as the
         number of parameters for the gridstart approach.')
  }

This is a really good point re time constraints. With runif, convergence makes a load of sense . Perhaps an idea is to include a convergence_count argument, which, when FALSE, fits all the iterations, or when set to 100, imposes a 100 fit count. For some models, the nls procedure can take a while (in my package for example, there are a few which take several seconds each), so 100 is perhaps not a perfect fit for every application. Then we could implement a couple of alternatives: if a convergence_count is set, then it uses a for-loop, if it is FALSE, then it can use a purrr::map() call, and then optionally parallelise,. In the latter case, there can be an additional argument of whether to run it in parallel, and if so, how many cores. Then it would just be to implement a warning for a wrong constellation of parameters: can't have parallel and convergence_count, can't have gridstart and convergence. Does that sound ok? It feels like this solution accommodates everything. Though it does become a little bit more complex.

Sounds reasonable. For the map() implementation, I would fit again for the best model at the end then. Maybe something for the future (not now though) to have another function to look at the coefficients though, since it might be worthwhile to know for a specific application, about the degree of sensitivity of the fit to the starting parameters.
Glad I understood it correctly then! I'll remove this then.

Cheers, G

PS: I just noticed the contributor badge - that's awesome! Thanks!

padpadpadpad commented 6 years ago

Closing this as all of these have been implemented

padpadpadpad / nls.multstart