stan-dev / posteriordb

Database with posteriors of interest for Bayesian inference
180 stars 36 forks source link

Aliases for posteriors #99

Closed MansMeg closed 4 years ago

MansMeg commented 4 years ago

Now we need to specify (sometimes) lengthy names for posteriors we may want to access quickly. So we would like to have a shorter name/aliases in the pdb.

I suggest that we add alias as json objects in the posterior db with the following structure:

{
alias_for: "[posterior_name]"
}

and store it as a JSON in the posterior database under posteriors.

Will that work in Python @eerolinna ?

MansMeg commented 4 years ago

Now extra posteriors are added as spearate json files in hash: cb5b52f

eerolinna commented 4 years ago

So for example we might want eight_schools as an alias for eight_schools-eight_schools_noncentered?

So with this proposal posteriors/eight_schools-eight_schools_noncentered.json would be

{
  "name": "eight_schools-eight_schools_noncentered",
  "keywords": ["stan_benchmark"],
  "model_name": "eight_schools_noncentered",
  "gold_standard_name": "eight_schools-eight_schools_noncentered",
  "data_name": "eight_schools",
  "dimensions": {"theta":8, "mu":1, "tau":1},
  "added_by": "Mans Magnusson",
  "added_date": "2019-08-12"
}

while posteriors/eight_schools.json would be

{
  "alias_for": "eight_schools-eight_schools_noncentered"
}

Did I get this right?

My first question would be that do we actually need the long names for anything? So can we rename eight_schools-eight_schools_noncentered.json to eight_schools.json and eight_schools-eight_schools_centered.json to eight_schools_centered.json

MansMeg commented 4 years ago

Yes. You got this right.

In principle, yes, we can just rename posteriors. But we would also like to have long posterior names in the format so that people can access posterior based on data and models. eight_schools work well for that specific posterior, but not all. So we would like both, hence the alias.

Currently, it is just a duplicate. I do not want to break the python code.

eerolinna commented 4 years ago

What if our API had the ability to do this?

po <- posterior(data = "eight_schools", model = "eight_schools_centered", pdb = pd)
po

Would the long names still be needed?

MansMeg commented 4 years ago

So we would need some formal way of naming posteriors. And then I think combining model and data names is the only feasible solution.

The API could do like that and throw an error if they cannot be combined. Although the focus is the posterior objects rather than data and model. So I do not really see the point?

eerolinna commented 4 years ago

Why do we need a formal way of naming posteriors?

I feel that a combination of model and data name is not really a posterior name. What I mean is that posterior("eight_schools-eight_schools_centered", pdb = pd) is not really a posterior name but rather just a data name and a model name.

We could have it so that posterior names are optional. For example prideprejustice_chapter-ldaK5 might not need an explicit name.

{
  "keywords": ["multimodal"],
  "model_name": "ldaK5",
  "gold_standard_name": null,
  "data_name": "prideprejustice_chapter",
  "dimensions": {"theta":244, "phi":7470},
  "added_by": "Mans Magnusson",
  "added_date": "2019-08-12"
}

The posterior would be accessed with

po <- posterior(data = "prideprejustice_chapter", model = "ldaK5", pdb = pd)

We could make it possible to still use

po <- posterior("prideprejustice_chapter-ldaK5", pdb = pd)

This would internally be translated to the data and model form call.

A posterior like 8_schools that has a name that is not just combination of data and model would look like

{
  "name": "eight_schools",
  "keywords": ["stan_benchmark"],
  "model_name": "eight_schools_noncentered",
  "gold_standard_name": "eight_schools",
  "data_name": "eight_schools",
  "dimensions": {"theta":8, "mu":1, "tau":1},
  "added_by": "Mans Magnusson",
  "added_date": "2019-08-12"
}

All of these would be fine for accessing the posterior

po <- posterior(data = "eight_schools", model = "eight_schools_centered", pdb = pd)
po <- posterior("eight_schools", pdb = pd)
po <- posterior("eight_schools-eight_schools_centered", pdb = pd)

This is just one way of addressing this use case, there might be better ways to do it.

In any case I feel like this use case is telling us that we need to change something at the design level to properly address it. To me the alias posterior approach doesn't feel like a change in the underlying design but rather a way of trying to fit the use case in the current design.

MansMeg commented 4 years ago

I do think that using data and model may also be a possibility. Although, a posterior is the combination of model (prior and likelihood) and data so it is not strange. The alias is just a way to solve the fact that that we may want to use multiple names for the same posterior.

eerolinna commented 4 years ago

One option could also be to have

{
  "names": ["eight_schools", "eight_schools-eight_schools_noncentered"],
  "model_name": "eight_schools_noncentered",
  "gold_standard_name": "eight_schools",
  "data_name": "eight_schools",
}

or

{
  "name": "eight_schools",
  "aliases": ["eight_schools-eight_schools_noncentered"],
  "model_name": "eight_schools_noncentered",
  "gold_standard_name": "eight_schools",
  "data_name": "eight_schools",
}

In any case, what should the semantics for posterior_names be?

Should the return value of

pn <- posterior_names(pd)

include just eight_schools (or whatever the primary name is) or also the aliases, in this case eight_schools-eight_schools_noncentered?

For discoverability it would be good to include also the aliases. For rerunning all gold standards in CI we wouldn't want the aliases. Perhaps it would be good to have posterior_names(pd, include_aliases = true) or something similar.

Should models and datasets also have aliases?

MansMeg commented 4 years ago

In principle I like your idea. The reason why I suggest we go for the alias file as a first step is that we otherwise need to index this by first reading all posterior jsons. Thats a small thing if we do it locally, but GitHub only allows 60 calls per hour if you do not use a PAT. Hence, this solution is a quick fix that should probably be handled as you say later on when we get up a real database.

eerolinna commented 4 years ago

Won't the current solution of having eight_schools.json and eight_schools-eight_schools_noncentered.json that are essentially duplicates be fine as a quick fix?

MansMeg commented 4 years ago

Yes, that's actually true.

MansMeg commented 4 years ago

I now repoen this issue, since Ben Bales had a problem with this in issue #138. I think you suggestion on having an include_alias = FALSE for posterior_names() is very reasonable.

I think the simplest approach now is to essentially add "alias" in the file name as "8schools.alias.json". Then it is possible to sort them out based on the file name and don't return duplicated posteriors.

eerolinna commented 4 years ago

I guess you mean https://github.com/MansMeg/posteriordb/issues/139 instead of https://github.com/MansMeg/posteriordb/issues/138?

Lets say we have

po <- posterior("eight_schools-eight_schools_noncentered", pdb = pd)
alias <- posterior("eight_schools", pdb = pd)

Does it always need to be that

Does there need to be

What about the name of the posterior? I'm not sure if the R library provides a way to get name of posterior so I'll switch to python for this

po = pd.posterior("eight_schools-eight_schools_noncentered")
alias = pd.posterior("eight_schools")

po.name is "eight_schools-eight_schools_noncentered". Should alias.name be "eight_schools" or "eight_schools-eight_schools_noncentered"?

MansMeg commented 4 years ago

Yes, the stan_code data and info should be identical.

The name of the alias should probably be eight_schools, but model info etc should be identical to the eight_schools-eight_schools_noncentered.

In a way there is a need to be able to see from the file name if it is an alias or not. So maybe the easiest would be to just call it eight_schools.alias.json in posterior.

Would that work fir the python inplementation as well?

The functionality alias() and alias_for() is good to have but not need to have.

eerolinna commented 4 years ago

Let me clarify I understand the .alias idea correctly. Lets say posteriors/eight_schools-eight_schools_noncentered.json is

{
  "name": "eight_schools-eight_schools_noncentered",
  "keywords": [
    "stan_benchmark"
  ],
  "model_name": "eight_schools_noncentered",
  "reference_posterior_name": "eight_schools-eight_schools_noncentered",
  "data_name": "eight_schools",
  "dimensions": {
    "theta": 8,
    "mu": 1,
    "tau": 1
  },
  "added_by": "Mans Magnusson",
  "added_date": "2019-08-12"
}

then would posteriors/eight_schools.alias.json be this?

{
  "name": "eight_schools",
  "keywords": [
    "stan_benchmark"
  ],
  "model_name": "eight_schools_noncentered",
  "reference_posterior_name": "eight_schools-eight_schools_noncentered",
  "data_name": "eight_schools",
  "dimensions": {
    "theta": 8,
    "mu": 1,
    "tau": 1
  },
  "added_by": "Mans Magnusson",
  "added_date": "2019-08-12"
}

So essentially the same but the name is changed.

I'm a bit worried that the two files could get out of sync. Lets say a PR adds keyword "hierarchical_model" to posteriors/eight_schools-eight_schools_noncentered.json but forgets to add it to the alias file.

Of course we can try to just be very careful when accepting PRs, but that leaves chance for human error. It also makes reviewing PRs more tedious because you have to manually check that the aliases are kept in sync.

One way of making that situation never occur is by changing eight_schools.alias.json to

{
"alias_for": "eight_schools-eight_schools_noncentered"
}

(it could have a name field or the name could just be obtained from the file name) Maybe this is what you had in mind already.

This solution would be fine in the python implementation. It would be a bit annoying that when looking for posterior eight_schools you would first have to see if posteriors/eight_schools.json exists and if not you would try posteriors/eight_schools.alias.json and then find eight_schools-eight_schools_noncentered from there and read posteriors/eight_schools-eight_schools_noncentered.json, but it would be okay.

Another way would be just to have an aliases.json file somewhere that would contain something like

{
  "eight_schools": "eight_schools-eight_schools_noncentered",
  "other_alias": "other_posterior",
  "...": "..."
}

So this one file would contain all the aliases. When looking for posterior eight_schools you would first check if posteriors/eight_schools.json exists and if not you would see if eight_schools is present in the alias file. When calling posterior_names(include_alias = FALSE) you would do no extra work. When calling posterior_names(include_alias = TRUE) you would append the keys from the aliases.json file to the result.

I probably would prefer the aliases.json implementation but having eight_schools.alias.json that contains alias_for key would be fine too, just more work from the implementation side.

eerolinna commented 4 years ago

However I feel that the current naming scheme for posteriors should be changed. 95% of the current posteriors follow a naming scheme like nes2000-nes, sir-sir, mesquite-logmesquite etc. where the model and data either have the same name or one of the names has an additional prefix or suffix but is otherwise the same. Why not name these nes2000, sir and logmesquite? So essentially when the model and the data have parts of the name in common, choose the longer name as the name for the posterior.

Posteriors like rstan_downloads-prophet could remain as they are now.

This would eliminate most of the need for aliases I think? If there's something that still needs aliases we would implement them.

As a part of this change I would suggest we add

po <- posterior(data = "nes2000", model = "nes", pdb = pd)

as an alternative way of constructing a posterior object. This is to cover the use cases where you have a data name and a model name and want to see if there's a posterior that uses them.

Or it could also have a different name like

po <- posterior_from_model_and_data(data = "nes2000", model = "nes", pdb = pd)

Currently this functionality would require reading every posterior file and seeing if the model and data match, which is not good. There's a way around this. We can generate an index file to make it easy to lookup posteriors by data and model name. The file could look something like this

{
  "data_name1": {
      "model_name1": "posterior_name1",
      "model_name2": "posterior_name2"
    },
  "data_name2": "..."
}

The key data_name1 would contain the model names that share a posterior with data_name1. The model name would map to the posterior name that they share.

This would make it efficient to lookup posteriors based on model and data name. The file could be autogenerated by a script quite easily and we could easily enforce in CI that the file is kept up-to-date when adding new posteriors.

MansMeg commented 4 years ago

I think your alias idea with a separate alias file would be the best and quickest solution. Shortening the names, as you suggest, would be another approach. Although the alias file would make this easy for now. Also this would not break python - python would just miss this feature for now.

The other stuff you suggest I need to think of. They are additional features rather than aliases. So I suggest they get theor own issues.

eerolinna commented 4 years ago

Hmm, if we shortened the names would we still want some aliases? Like if eight_schools-eight_schools_noncentered gets shortened to eight_schools_noncentered would we still want the aliases eight_schools and/or 8_schools?

If that is the case then implementing the alias file now sounds good. It does break python in a sense ( posterior sir would be accessable from R but not python), but adding support for it in python should be quite easy.

Shortening the names

How do you feel about shortening the names? To me it feels like there's essentially no downside to it, but maybe you see something? The only negative thing about it is that it is a breaking change. Yet we are still in beta so doing breaking changes now would be better than doing them later.

Keeping long names

If there's a reason to keep the long names I would suggest a change that achieves the following

This would not be a breaking change. The actual change would be

Other stuff

By other stuff you mean having

po <- posterior(data = "nes2000", model = "nes", pdb = pd)

and making this fast with the index file? Yea sure I can make a separate issue for these.

MansMeg commented 4 years ago

I actually don't have any good reason for the long names right now in mind, but I have a vague feeling there was a reason for it early on. That's why I'm a little hesitant. I think the reason was that the posterior is made up by data and model so it is a quick way to see data and model. The main problem right now is to get rid of the duplicates due to the current aliases.

So if we look at the current alias, they are actually partly about shortening names and partly about different alias names. So I think we need the alias files anyway.

Given that we want alias, I think keeping the long names is the easiest way forward to solve this, and adding short names as aliases in the alias file. This would solve this issue. Then I think the discussion on the shortening of names can be set up as its own issue, but something that doesn't need to be fixed urgently.

eerolinna commented 4 years ago

Sounds good!

MansMeg commented 4 years ago

Alright. Now this is implemented.