Add information about the inputs of the models

mandel commented 4 years ago

Thank you for your nice project!

For some posteriors, the data contains more entries than what is expected for the model. For example in kidiq-kidscore_momiq:

the model (kidscore_momiq) expects N, kid_score, and mom_iq
but the data (kidiq) also provides mom_hs, mom_hs_new, and mom_iq_new.

In order to pass automatically the correct inputs for the inference, I find useful to have information about the expected inputs of the model.

In this pull request, I have added for each Stan model in the model.info.json file an additional field with the information about the expected data. For example, on the kidscore_momiq model, now there is the following field:

  "inputs": { "N": { "type": "int", "dimensions": 0},
              "kid_score": { "type": "real", "dimensions": 1},
              "mom_iq": { "type": "real", "dimensions": 1} },

I also need the expected type to convert the data into the right format. So, I have associated this information to each input.

Thank you for considering this PR. I am happy to change the format if you want.

Best

MansMeg commented 4 years ago

I think this is excellent! I have been thinking about doing a similar thing to the documentation but also enable som descriptions of each input (in addition to the stuff you propose).

It seems like you have a script to generate these things from the stan code? Is it possible to extract this from the Stan code?

The two problems I see is that we, if we add this, need to check that the documentation is consistent with the stan code, that why I ask if there is an automated script behind to extract this.

@paul-buerkner and @avehtari do you have any thoughts on this?

mandel commented 4 years ago

Thanks @MansMeg.

Yes, this information was extracted automatically. I did a small hack in the Stan compiler to add this field to the json. I can share the code (but for now, it is really a hack).

MansMeg commented 4 years ago

Great!

I’m wondering how to include this then.

I guess the description of the variables should probably be on the dataset side. But probably with the same structure. So I guess this is probably a good way to do it.

The name input make sense. Then we can add that as a slot with default value NULL and see if we can find a way to update this automatically further down the line.

But lets see if @vehtari or @paul-Buerkner has any thoughts befor I merge.

mandel commented 4 years ago

The code which updates the *.info.json files is in the info branch of the fork of the Stan compiler which is on my github account (https://github.com/mandel/stanc3/tree/info). It makes the following assumptions:

For a model file m.stan is going to update the info file ../m.info.json.
It adds the inputs field before the model_implementations field.

It has been developed as a one time thing. I could cleanup the code if you think that is can be useful. I can also compute other information on the model if you want.

MansMeg commented 4 years ago

I think both those assumptions are reasonable and should work well.

I think in the long run we would like to be able to reuse that script. The question is just where to put it.

MansMeg commented 4 years ago

Hi again!

So now we have discussed and since this information is already available in the data block, we think it is better to have a script to access it rather than the actual results. In this way, we would not store any information that could fall out of sync (due to changes in models).

Hence it would be great if you could contribute the script, rather than the JSON information - if that would be ok for you?

mandel commented 4 years ago

HI,

I understand your point of view.

I am happy to cleanup the code such that it emits the information. But I don't know how to contribute this code (it is a fork of the Stan compiler). Do you have anything in mind?

MansMeg commented 4 years ago

Hi. I close this PR and continue this discussion as an issue instead.

I don't have any quick solution. Maybe the simplest is that we just refer to your repo if you have a README specifying how this can be done in a simple way?

stan-dev / posteriordb

Add information about the inputs of the models #192