Closed MansMeg closed 4 years ago
Now extra posteriors are added as spearate json files in hash: cb5b52f
So for example we might want eight_schools
as an alias for eight_schools-eight_schools_noncentered
?
So with this proposal posteriors/eight_schools-eight_schools_noncentered.json
would be
{
"name": "eight_schools-eight_schools_noncentered",
"keywords": ["stan_benchmark"],
"model_name": "eight_schools_noncentered",
"gold_standard_name": "eight_schools-eight_schools_noncentered",
"data_name": "eight_schools",
"dimensions": {"theta":8, "mu":1, "tau":1},
"added_by": "Mans Magnusson",
"added_date": "2019-08-12"
}
while posteriors/eight_schools.json
would be
{
"alias_for": "eight_schools-eight_schools_noncentered"
}
Did I get this right?
My first question would be that do we actually need the long names for anything? So can we rename eight_schools-eight_schools_noncentered.json
to eight_schools.json
and eight_schools-eight_schools_centered.json
to eight_schools_centered.json
Yes. You got this right.
In principle, yes, we can just rename posteriors. But we would also like to have long posterior names in the format so that people can access posterior based on data and models. eight_schools work well for that specific posterior, but not all. So we would like both, hence the alias.
Currently, it is just a duplicate. I do not want to break the python code.
What if our API had the ability to do this?
po <- posterior(data = "eight_schools", model = "eight_schools_centered", pdb = pd)
po
Would the long names still be needed?
So we would need some formal way of naming posteriors. And then I think combining model and data names is the only feasible solution.
The API could do like that and throw an error if they cannot be combined. Although the focus is the posterior objects rather than data and model. So I do not really see the point?
Why do we need a formal way of naming posteriors?
I feel that a combination of model and data name is not really a posterior name. What I mean is that posterior("eight_schools-eight_schools_centered", pdb = pd)
is not really a posterior name but rather just a data name and a model name.
We could have it so that posterior names are optional. For example prideprejustice_chapter-ldaK5
might not need an explicit name.
{
"keywords": ["multimodal"],
"model_name": "ldaK5",
"gold_standard_name": null,
"data_name": "prideprejustice_chapter",
"dimensions": {"theta":244, "phi":7470},
"added_by": "Mans Magnusson",
"added_date": "2019-08-12"
}
The posterior would be accessed with
po <- posterior(data = "prideprejustice_chapter", model = "ldaK5", pdb = pd)
We could make it possible to still use
po <- posterior("prideprejustice_chapter-ldaK5", pdb = pd)
This would internally be translated to the data and model form call.
A posterior like 8_schools
that has a name that is not just combination of data and model would look like
{
"name": "eight_schools",
"keywords": ["stan_benchmark"],
"model_name": "eight_schools_noncentered",
"gold_standard_name": "eight_schools",
"data_name": "eight_schools",
"dimensions": {"theta":8, "mu":1, "tau":1},
"added_by": "Mans Magnusson",
"added_date": "2019-08-12"
}
All of these would be fine for accessing the posterior
po <- posterior(data = "eight_schools", model = "eight_schools_centered", pdb = pd)
po <- posterior("eight_schools", pdb = pd)
po <- posterior("eight_schools-eight_schools_centered", pdb = pd)
This is just one way of addressing this use case, there might be better ways to do it.
In any case I feel like this use case is telling us that we need to change something at the design level to properly address it. To me the alias posterior approach doesn't feel like a change in the underlying design but rather a way of trying to fit the use case in the current design.
I do think that using data and model may also be a possibility. Although, a posterior is the combination of model (prior and likelihood) and data so it is not strange. The alias is just a way to solve the fact that that we may want to use multiple names for the same posterior.
One option could also be to have
{
"names": ["eight_schools", "eight_schools-eight_schools_noncentered"],
"model_name": "eight_schools_noncentered",
"gold_standard_name": "eight_schools",
"data_name": "eight_schools",
}
or
{
"name": "eight_schools",
"aliases": ["eight_schools-eight_schools_noncentered"],
"model_name": "eight_schools_noncentered",
"gold_standard_name": "eight_schools",
"data_name": "eight_schools",
}
In any case, what should the semantics for posterior_names
be?
Should the return value of
pn <- posterior_names(pd)
include just eight_schools
(or whatever the primary name is) or also the aliases, in this case eight_schools-eight_schools_noncentered
?
For discoverability it would be good to include also the aliases. For rerunning all gold standards in CI we wouldn't want the aliases. Perhaps it would be good to have posterior_names(pd, include_aliases = true)
or something similar.
Should models and datasets also have aliases?
In principle I like your idea. The reason why I suggest we go for the alias file as a first step is that we otherwise need to index this by first reading all posterior jsons. Thats a small thing if we do it locally, but GitHub only allows 60 calls per hour if you do not use a PAT. Hence, this solution is a quick fix that should probably be handled as you say later on when we get up a real database.
Won't the current solution of having eight_schools.json
and eight_schools-eight_schools_noncentered.json
that are essentially duplicates be fine as a quick fix?
Yes, that's actually true.
I now repoen this issue, since Ben Bales had a problem with this in issue #138. I think you suggestion on having an include_alias = FALSE
for posterior_names()
is very reasonable.
I think the simplest approach now is to essentially add "alias" in the file name as "8schools.alias.json". Then it is possible to sort them out based on the file name and don't return duplicated posteriors.
I guess you mean https://github.com/MansMeg/posteriordb/issues/139 instead of https://github.com/MansMeg/posteriordb/issues/138?
Lets say we have
po <- posterior("eight_schools-eight_schools_noncentered", pdb = pd)
alias <- posterior("eight_schools", pdb = pd)
Does it always need to be that
stan_code(po) == stan_code(alias)
pdb_data(po) == pdb_data(alias)
data_info(po) == data_info(alias)
model_info
etc. Does there need to be
aliases(po)
, returns list of aliases for po
alias_of(alias)
, returns the posterior that alias
is alias ofWhat about the name of the posterior? I'm not sure if the R library provides a way to get name of posterior so I'll switch to python for this
po = pd.posterior("eight_schools-eight_schools_noncentered")
alias = pd.posterior("eight_schools")
po.name
is "eight_schools-eight_schools_noncentered"
. Should alias.name
be "eight_schools"
or "eight_schools-eight_schools_noncentered"
?
Yes, the stan_code data and info should be identical.
The name of the alias should probably be eight_schools, but model info etc should be identical to the eight_schools-eight_schools_noncentered.
In a way there is a need to be able to see from the file name if it is an alias or not. So maybe the easiest would be to just call it eight_schools.alias.json in posterior.
Would that work fir the python inplementation as well?
The functionality alias() and alias_for() is good to have but not need to have.
Let me clarify I understand the .alias
idea correctly. Lets say posteriors/eight_schools-eight_schools_noncentered.json
is
{
"name": "eight_schools-eight_schools_noncentered",
"keywords": [
"stan_benchmark"
],
"model_name": "eight_schools_noncentered",
"reference_posterior_name": "eight_schools-eight_schools_noncentered",
"data_name": "eight_schools",
"dimensions": {
"theta": 8,
"mu": 1,
"tau": 1
},
"added_by": "Mans Magnusson",
"added_date": "2019-08-12"
}
then would posteriors/eight_schools.alias.json
be this?
{
"name": "eight_schools",
"keywords": [
"stan_benchmark"
],
"model_name": "eight_schools_noncentered",
"reference_posterior_name": "eight_schools-eight_schools_noncentered",
"data_name": "eight_schools",
"dimensions": {
"theta": 8,
"mu": 1,
"tau": 1
},
"added_by": "Mans Magnusson",
"added_date": "2019-08-12"
}
So essentially the same but the name is changed.
I'm a bit worried that the two files could get out of sync. Lets say a PR adds keyword "hierarchical_model"
to posteriors/eight_schools-eight_schools_noncentered.json
but forgets to add it to the alias file.
Of course we can try to just be very careful when accepting PRs, but that leaves chance for human error. It also makes reviewing PRs more tedious because you have to manually check that the aliases are kept in sync.
One way of making that situation never occur is by changing eight_schools.alias.json
to
{
"alias_for": "eight_schools-eight_schools_noncentered"
}
(it could have a name field or the name could just be obtained from the file name) Maybe this is what you had in mind already.
This solution would be fine in the python implementation. It would be a bit annoying that when looking for posterior eight_schools
you would first have to see if posteriors/eight_schools.json
exists and if not you would try posteriors/eight_schools.alias.json
and then find eight_schools-eight_schools_noncentered
from there and read posteriors/eight_schools-eight_schools_noncentered.json
, but it would be okay.
Another way would be just to have an aliases.json
file somewhere that would contain something like
{
"eight_schools": "eight_schools-eight_schools_noncentered",
"other_alias": "other_posterior",
"...": "..."
}
So this one file would contain all the aliases. When looking for posterior eight_schools
you would first check if posteriors/eight_schools.json
exists and if not you would see if eight_schools
is present in the alias file. When calling posterior_names(include_alias = FALSE)
you would do no extra work. When calling posterior_names(include_alias = TRUE)
you would append the keys from the aliases.json
file to the result.
I probably would prefer the aliases.json
implementation but having eight_schools.alias.json
that contains alias_for
key would be fine too, just more work from the implementation side.
However I feel that the current naming scheme for posteriors should be changed. 95% of the current posteriors follow a naming scheme like nes2000-nes
, sir-sir
, mesquite-logmesquite
etc. where the model and data either have the same name or one of the names has an additional prefix or suffix but is otherwise the same. Why not name these nes2000
, sir
and logmesquite
? So essentially when the model and the data have parts of the name in common, choose the longer name as the name for the posterior.
Posteriors like rstan_downloads-prophet
could remain as they are now.
This would eliminate most of the need for aliases I think? If there's something that still needs aliases we would implement them.
As a part of this change I would suggest we add
po <- posterior(data = "nes2000", model = "nes", pdb = pd)
as an alternative way of constructing a posterior object. This is to cover the use cases where you have a data name and a model name and want to see if there's a posterior that uses them.
Or it could also have a different name like
po <- posterior_from_model_and_data(data = "nes2000", model = "nes", pdb = pd)
Currently this functionality would require reading every posterior file and seeing if the model and data match, which is not good. There's a way around this. We can generate an index file to make it easy to lookup posteriors by data and model name. The file could look something like this
{
"data_name1": {
"model_name1": "posterior_name1",
"model_name2": "posterior_name2"
},
"data_name2": "..."
}
The key data_name1
would contain the model names that share a posterior with data_name1
. The model name would map to the posterior name that they share.
This would make it efficient to lookup posteriors based on model and data name. The file could be autogenerated by a script quite easily and we could easily enforce in CI that the file is kept up-to-date when adding new posteriors.
I think your alias idea with a separate alias file would be the best and quickest solution. Shortening the names, as you suggest, would be another approach. Although the alias file would make this easy for now. Also this would not break python - python would just miss this feature for now.
The other stuff you suggest I need to think of. They are additional features rather than aliases. So I suggest they get theor own issues.
Hmm, if we shortened the names would we still want some aliases? Like if eight_schools-eight_schools_noncentered
gets shortened to eight_schools_noncentered
would we still want the aliases eight_schools
and/or 8_schools
?
If that is the case then implementing the alias file now sounds good. It does break python in a sense ( posterior sir
would be accessable from R but not python), but adding support for it in python should be quite easy.
How do you feel about shortening the names? To me it feels like there's essentially no downside to it, but maybe you see something? The only negative thing about it is that it is a breaking change. Yet we are still in beta so doing breaking changes now would be better than doing them later.
If there's a reason to keep the long names I would suggest a change that achieves the following
posterior_names()
would show the long names but not the short names (I assume include_aliases = FALSE
would be the default)posterior_names()
would show the short names but not the long names, which I feel is betterThis would not be a breaking change. The actual change would be
sir-sir
to sir
but keep rstan_downloads-prophet
as is)sir-sir
as alias for sir
)By other stuff you mean having
po <- posterior(data = "nes2000", model = "nes", pdb = pd)
and making this fast with the index file? Yea sure I can make a separate issue for these.
I actually don't have any good reason for the long names right now in mind, but I have a vague feeling there was a reason for it early on. That's why I'm a little hesitant. I think the reason was that the posterior is made up by data and model so it is a quick way to see data and model. The main problem right now is to get rid of the duplicates due to the current aliases.
So if we look at the current alias, they are actually partly about shortening names and partly about different alias names. So I think we need the alias files anyway.
Given that we want alias, I think keeping the long names is the easiest way forward to solve this, and adding short names as aliases in the alias file. This would solve this issue. Then I think the discussion on the shortening of names can be set up as its own issue, but something that doesn't need to be fixed urgently.
Sounds good!
Alright. Now this is implemented.
Now we need to specify (sometimes) lengthy names for posteriors we may want to access quickly. So we would like to have a shorter name/aliases in the pdb.
I suggest that we add alias as json objects in the posterior db with the following structure:
and store it as a JSON in the posterior database under posteriors.
Will that work in Python @eerolinna ?