Closed seananderson closed 7 years ago
Yes, I think it could be done in the function -- using the approach you suggested, with factor() nested in as.integer()
Okay, should be done with this commit: https://github.com/seananderson/glmmfields/commit/38e536191e7e2ae5bda5f0e240c3373c78ca3abd
Everything seems to work fine. I added a check that the station argument is still not in the function call because otherwise it gets passed on to Stan in the ..., which creates a confusing error.
I think I closed this too quickly. When I run the GLM vignette, the new version takes 4-5 times longer and it doesn't converge. The previous version converges quickly. If you step through format_data
the only difference comes down to whether the station IDs look like:
seq(1, nrow(data))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
[52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
[69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
[86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
or
as.numeric(as.factor(data[,station]))
[1] 27 96 62 54 45 88 39 33 19 21 52 28 23 69 3 72 38
[18] 47 84 92 32 95 75 70 4 40 51 59 71 91 63 48 57 7
[35] 30 43 22 85 16 82 58 66 20 64 35 73 44 98 97 74 29
[52] 24 61 31 56 80 18 46 50 86 93 90 68 94 55 60 36 37
[69] 2 53 87 1 10 17 79 76 99 49 11 65 78 15 42 25 6
[86] 41 8 26 5 67 34 13 9 89 77 83 100 14 12 81
Both have 100 unique stations and a single slice of time. It's just the order that is different. I wouldn't have thought that that would have mattered.
I assume this means the station IDs need to be ordered for the Stan indexing to work although I haven't wrapped my head around why that would be. Perhaps this also means that we should be creating the station IDs in order for the first time slice and then rearranging the data within each following time slices so that the station IDs are in ascending order?
As a quick solution I just changed as.factor for forcats::as_factor. That creates factors that are ordered by order of appearance instead of alphabetically. This make me wonder if the beetle example could be run much more efficiently now with correctly ordered station IDs. Still wish I had my head fully wrapped around this one.
I think this might be a combination of things on the R / Stan side. The stationID links responses to predicted values on the Stan side, but we also have the stations ordered in the distance matrix -- like on lines 32-50 of format_data?
Perhaps this also means that we should be creating the station IDs in order for the first time slice and then rearranging the data within each following time slices so that the station IDs are in ascending order? I just don't know how this'd work for cases where you don't have observations from the first station in the first time slice?
I wrote some unit tests to test this. See https://github.com/seananderson/glmmfields/blob/master/tests/testthat/test-stations.R
The first one where I scrambled the stations in the second time slice seems to be fine.
The second one where I remove a couple stations from the first time slice may be causing a problem. The predictions are fairly similar for most data points, but the mismatches can be up to 0.15, which is quite a bit when the data range between -0.4 and 0.4. For multiple fits to the same data set, as a point of comparison, the maximum difference tends to be around 0.01 or less with the vast majority around 0.004 different. So there may be a problem here.
This is all using the station IDs as stationID <- as.numeric(forcats::as_factor(data[,station]))
, which names the station IDs by first appearance. I assume if the problem is coming from anywhere, it's from the sorted index after removing duplicates and then calculating the distance matrix in the format_data
function.
Fixed with https://github.com/seananderson/glmmfields/commit/40eca12139d102490696d2b431b1a63c49dbce5c and a couple commits after.
There was a problem with creating the knot-sample distance matrix if a new station appeared in a subsequent time slice.
It occurred to me, while staring at the main function call, that the
station_id
argument may be redundant. Couldn't we just create that index internally by pasting together the latitude and longitude coordinates and turning that into a factor and then integer?This would also eliminate the possibility for the user to mess things up by changing the coordinate station ID matching somewhere in the data set.
Unless there is some situation I'm not thinking of, I suggest we remove the argument and add that column to the data frame within the package.