station_id argument is redundant?

seananderson commented 7 years ago

It occurred to me, while staring at the main function call, that the station_id argument may be redundant. Couldn't we just create that index internally by pasting together the latitude and longitude coordinates and turning that into a factor and then integer?

This would also eliminate the possibility for the user to mess things up by changing the coordinate station ID matching somewhere in the data set.

Unless there is some situation I'm not thinking of, I suggest we remove the argument and add that column to the data frame within the package.

ericward-noaa commented 7 years ago

Yes, I think it could be done in the function -- using the approach you suggested, with factor() nested in as.integer()

seananderson commented 7 years ago

Okay, should be done with this commit: https://github.com/seananderson/glmmfields/commit/38e536191e7e2ae5bda5f0e240c3373c78ca3abd

Everything seems to work fine. I added a check that the station argument is still not in the function call because otherwise it gets passed on to Stan in the ..., which creates a confusing error.

seananderson commented 7 years ago

I think I closed this too quickly. When I run the GLM vignette, the new version takes 4-5 times longer and it doesn't converge. The previous version converges quickly. If you step through format_data the only difference comes down to whether the station IDs look like:

 seq(1, nrow(data))
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
 [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
 [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
 [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
 [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
 [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

or

as.numeric(as.factor(data[,station]))
  [1]  27  96  62  54  45  88  39  33  19  21  52  28  23  69   3  72  38
 [18]  47  84  92  32  95  75  70   4  40  51  59  71  91  63  48  57   7
 [35]  30  43  22  85  16  82  58  66  20  64  35  73  44  98  97  74  29
 [52]  24  61  31  56  80  18  46  50  86  93  90  68  94  55  60  36  37
 [69]   2  53  87   1  10  17  79  76  99  49  11  65  78  15  42  25   6
 [86]  41   8  26   5  67  34  13   9  89  77  83 100  14  12  81

Both have 100 unique stations and a single slice of time. It's just the order that is different. I wouldn't have thought that that would have mattered.

I assume this means the station IDs need to be ordered for the Stan indexing to work although I haven't wrapped my head around why that would be. Perhaps this also means that we should be creating the station IDs in order for the first time slice and then rearranging the data within each following time slices so that the station IDs are in ascending order?

seananderson commented 7 years ago

As a quick solution I just changed as.factor for forcats::as_factor. That creates factors that are ordered by order of appearance instead of alphabetically. This make me wonder if the beetle example could be run much more efficiently now with correctly ordered station IDs. Still wish I had my head fully wrapped around this one.

ericward-noaa commented 7 years ago

I think this might be a combination of things on the R / Stan side. The stationID links responses to predicted values on the Stan side, but we also have the stations ordered in the distance matrix -- like on lines 32-50 of format_data?

Perhaps this also means that we should be creating the station IDs in order for the first time slice and then rearranging the data within each following time slices so that the station IDs are in ascending order? I just don't know how this'd work for cases where you don't have observations from the first station in the first time slice?

seananderson commented 7 years ago

I wrote some unit tests to test this. See https://github.com/seananderson/glmmfields/blob/master/tests/testthat/test-stations.R

The first one where I scrambled the stations in the second time slice seems to be fine.

The second one where I remove a couple stations from the first time slice may be causing a problem. The predictions are fairly similar for most data points, but the mismatches can be up to 0.15, which is quite a bit when the data range between -0.4 and 0.4. For multiple fits to the same data set, as a point of comparison, the maximum difference tends to be around 0.01 or less with the vast majority around 0.004 different. So there may be a problem here.

This is all using the station IDs as stationID <- as.numeric(forcats::as_factor(data[,station])), which names the station IDs by first appearance. I assume if the problem is coming from anywhere, it's from the sorted index after removing duplicates and then calculating the distance matrix in the format_data function.

seananderson commented 7 years ago

Fixed with https://github.com/seananderson/glmmfields/commit/40eca12139d102490696d2b431b1a63c49dbce5c and a couple commits after.

There was a problem with creating the knot-sample distance matrix if a new station appeared in a subsequent time slice.

seananderson / glmmfields

station_id argument is redundant? #2