Open sambrilleman opened 3 years ago
Yeah I see what you mean. I'm not opposed to the solution that you propose but another option could be to switch to using the column name .id
instead of id
to indicate that it's a metadata column (this is similar to what we're doing in the posterior package with column names like .chain
instead of chain
). Or maybe the name .row_id
would be even less likely to be confused with a variable in the data.
Summary:
For
stan_surv
models theid
column in the data frame returned byposterior_survfit()
represents the row of the covariate data that the predictions "belong" to. But theid
column can be confusing, especially when the nameid
was used for one of the variables (e.g. the grouping factor) in the model itself.The more explicit thing might be to return all of the covariate data used in the predictions as part of the data returned by
posterior_survfit
; that way the user knows exactly the covariate values that the row of predictions belongs to. Rather than having to infer it based on some "id" (i.e. row identifier) column.This is essentially what tripped up one user here.
Reproducible Steps:
Here we fit a model with a grouping factor called
id
with values5:10
, then we predict survival probabilities for the groups that hadid
values5
and6
which returns:
but we see that the
id
column in the predictions has values1
through10
. This is because it is referring to rows 1 through 10 of the prediction covariate data, not the values for theid
variable in the original model! It would be clearer / safer if we just returned all the covariates, includingid
, and then we wouldn't even need a row identifier.