Consistency between model code and dataset paths in info files

stan-dev / posteriordb

Database with posteriors of interest for Bayesian inference

181 stars 36 forks source link

Consistency between model code and dataset paths in info files #50

Closed eerolinna closed 3 years ago

eerolinna commented 5 years ago

In the model info files we have

"model_code": {
    "stan": "content/models/stan/8_schools_centered.stan"
},

which when concatenated with the path to posterior db gives the full filepath to the stan code file.

The dataset info files have

"data_file": "content/datasets/data/8_schools.json"

which does not give the full filepath when concatenated with the path to the posterior db. This is because a ".zip" needs to be added to the end.

I propose we change the dataset info files to

"data_file": "content/datasets/data/8_schools.json.zip"

eerolinna commented 4 years ago

This is now over 1 month old, any comments? @MansMeg @paul-buerkner

MansMeg commented 4 years ago

Im currently fixing everything for prototype draft. So I will answer asap.

MansMeg commented 4 years ago

I think this is probably a good idea. Although this is the current structure so I think we now add this as an improvement but not in the prototype.

eerolinna commented 4 years ago

This won't change the public API so we can do this later too, however I think it would be good to do this sooner than later. This is a quite small change so I could go ahead and prepare a PR for it soon.

However if you want more time to think if this is a good change we can also wait.

If we want to go ahead I think these are the required changes

Add .zip to data_file in every data info file
In https://github.com/MansMeg/posteriordb/blob/a6c90bbe1dd768a4c238253a09b8a453828cb05d/python/src/posteriordb/posterior_database.py#L68 remove the .zip
Remove https://github.com/MansMeg/posteriordb/blob/62c4540e6f8ebedf9b4a457350c3fba1ab9a408e/rpackage/R/pdb.R#L245 and https://github.com/MansMeg/posteriordb/blob/62c4540e6f8ebedf9b4a457350c3fba1ab9a408e/rpackage/R/pdb.R#L253

I think there might be some other small changes required but nothing too big.

MansMeg commented 4 years ago

So I just added you to the repository. Start a new branch and change it there. Then I can fix the R stuff in that branch and when the branch works, we can merge it. You could see if you could get it to work with the R code.

It's just not the highest priority right now for me, but in the long run, we want it to be done.

eerolinna commented 4 years ago

Great! I think even if I start a new branch in my fork you'll still be able to make fixes directly to it. I will start with that as then there is 0% chance of accidently pushing to master in this repo. If we run into problems we can always merge that PR to a separate branch in this repo and continue from there.

eerolinna commented 4 years ago

I propose a further change. In addition to having filenames ending with .zip like

"data_file": "content/datasets/data/8_schools.json.zip"

it is also possible to have

"data_file": "content/datasets/data/8_schools.json"

In this case the dataset file is not zipped in the posterior database but is just a normal JSON file.

I also suggest that most small files would be stored unzipped and only large files should be zipped.

eerolinna commented 4 years ago

Any comments on this?

MansMeg commented 4 years ago

I think it is easier to have all zipped for now?

MansMeg commented 3 years ago

I close this for now. If this becomes a problem down the line we solve that then.