stan-dev / posteriordb

Database with posteriors of interest for Bayesian inference
176 stars 36 forks source link

Enable Python to access posteriors as the R package #115

Closed MansMeg closed 3 years ago

eerolinna commented 4 years ago

What does this include? The remote database?

MansMeg commented 4 years ago

Yes. Sorry for the vague description. The R package can access everything from github now through the Github API. So I just specify a pdb to be a pdb_github and then everything works the same way. In the longer run we would like to use a real database, but Github works fine for now in beta.

eerolinna commented 4 years ago

I'm unassigning myself from this issue. I won't work on this project forever and I think it would be good if we find someone who can work with the python side sooner than later. (this is not in response to anything, I just think it would be good to find someone who can maintain the python side, be that you or someone else)

I'm available to help with this though.

ahartikainen commented 3 years ago

Ok, I think I know how to do this. Atleast the main block

Download links

This call gets a dictionary containing file information (keys == path in posterior_database; values=metadata as a dict). Each item has a "download_url" which can be used to download the file from github.

import requests

def get_content(url):
    results = requests.get(url)
    contents = {}
    for content in results.json():
        try:
            if content["type"] == "dir":
                contents.update(get_content(content["_links"]["self"]))
            else:
                contents[content["path"]] = content
        except:
            print(content)
            continue
    return contents

url = "https://api.github.com/repos/MansMeg/posteriordb/contents/posterior_database?ref=master"
links = get_content(url)

Then to download specific file (e.g. filter) or all files one needs to go through dictionary. There is sha which can be used to decide if the file is going to be overwritten if found.

e.g. key = 'posterior_database/alias/posteriors.json'

{
 'name': 'posteriors.json',
 'path': 'posterior_database/alias/posteriors.json',
 'sha': '47f01b7be6855f9451c70426230ef68db16f982e',
 'size': 518,
 'url': 'https://api.github.com/repos/MansMeg/posteriordb/contents/posterior_database/alias/posteriors.json?ref=master',
 'html_url': 'https://github.com/MansMeg/posteriordb/blob/master/posterior_database/alias/posteriors.json',
 'git_url': 'https://api.github.com/repos/MansMeg/posteriordb/git/blobs/47f01b7be6855f9451c70426230ef68db16f982e',
 'download_url': 'https://raw.githubusercontent.com/MansMeg/posteriordb/master/posterior_database/alias/posteriors.json',
 'type': 'file',
 '_links': {
    'self': 'https://api.github.com/repos/MansMeg/posteriordb/contents/posterior_database/alias/posteriors.json?ref=master',
    'git': 'https://api.github.com/repos/MansMeg/posteriordb/git/blobs/47f01b7be6855f9451c70426230ef68db16f982e',
    'html': 'https://github.com/MansMeg/posteriordb/blob/master/posterior_database/alias/posteriors.json'
  }
}

Download file

Simple example for file download. This should use verify=True (certificate issue needs to be solved; W10 had problems)

import tempfile
import os

def download_file(url, path):
    parent_dir = os.path.dirname(path)
    if not os.path.exists(parent_dir):
        os.makedirs(parent_dir)
    try:
        r = requests.get(posteriors_url, verify=False)  
        with open(path, mode='wb') as f:
            f.write(r.content)
    except Exception as e:
        print(e)
        return False
    return True

So a simple example to download posteriors.json

root_handle = tempfile.TemporaryDirectory(suffix="posterior_database")
root = root_handle.name

posteriors_key = 'posterior_database/alias/posteriors.json'
posteriors_path = os.path.join(root, posteriors_key)
posteriors_url = links[posteriors_key]["download_url"]

res = download_file(posteriors_url, posteriors_path)

Then one needs to know what are the needed files that are downloaded when the database is created (updated) and then I think these files can be downloaded when needed. E.g. sha helps to select if file needs to be updated.

ahartikainen commented 3 years ago

The default database location could be in $HOME/.posteriordb/posterior_database which then could be updated when needed.

MansMeg commented 3 years ago

This looks nice to me.

ahartikainen commented 3 years ago

We can probably even assume that the main files have "static" URL so there is no need to call each file manually, but only use that in the case it doesn't find the correct values (e.g. something is changed in future github side).

MansMeg commented 3 years ago

Yes indeed.

ahartikainen commented 3 years ago

I think we can close this now

eerolinna commented 3 years ago

One suggestion I have here is to key the cache by file hash instead of path

The current approach

I'll summarise what I understood the current plan to be to make sure I didn't misunderstand anything

posterior_database/alias/posteriors.json would be downloaded locally to $HOME/.posteriordb/posterior_database/alias/posteriors.json

If posteriors.json gets updated the new version would overwrite the old version in the cache

Keying by the hash

posterior_database/alias/posteriors.json with sha 47f01b7be6855f9451c70426230ef68db16f982e would be downloaded to $HOME/.posteriordb/posterior_database/47f01b7be6855f9451c70426230ef68db16f982e

If posteriors.json gets updated the new version with sha new_hash would be stored in $HOME/.posteriordb/posterior_database/new_hash which does not overwrite the old version.

The updated code to download posteriors.json would be

root = '$HOME/.posteriordb/posterior_database'
posteriors_key = 'posterior_database/alias/posteriors.json'
posteriors_hash = links[posteriors_key]["sha"]
posteriors_path = os.path.join(root, posteriors_hash)
posteriors_url = links[posteriors_key]["download_url"]

res = download_file(posteriors_url, posteriors_path)

This would allow many projects that might use different posteriordb versions to share the same cache.

ahartikainen commented 3 years ago

Sure, but how would different posteriordb versions know what hash to use?

MansMeg commented 3 years ago

I also think this might take unnecessary space. If we download a whole database for each hash we might download large (unchanged files) even for small updates of the README.md file?

eerolinna commented 3 years ago

I also think this might take unnecessary space. If we download a whole database for each hash we might download large (unchanged files) even for small updates of the README.md file?

I mean using the file hash, not the commit hash.

Small updates of the README.md would only add the new version of README.md to the cache, everything else would remain as is.

Sure, but how would different posteriordb versions know what hash to use?

The best way to me seems to persist links to the project directory.

So you'd have something like posterior-hashes.json

{
  "posterior_database/alias/posteriors.json": {
    "hash": "47f01b7be6855f9451c70426230ef68db16f982e",
    "download_url": "permalink"
  },
  "..."
}

posteriordb is basically a package manager for posteriors. This file would be analogous to a lock file, like package-lock.json for npm.

ahartikainen commented 3 years ago

I also think this might take unnecessary space. If we download a whole database for each hash we might download large (unchanged files) even for small updates of the README.md file?

I mean using the file hash, not the commit hash.

Small updates of the README.md would only add the new version of README.md to the cache, everything else would remain as is.

But we only work with files in posterior_database

So you'd have something like posterior-hashes.json

Ok, but how would user use this?

eerolinna commented 3 years ago

But we only work with files in posterior_database

You mean that README.md doesn't need to be cached as it not used by the library? Yeah you wouldn't cache it at all, that was a bad example, I should've used posteriors.json.

Updated example: So small updates of the posteriors.json would only add the new version of posteriors.json to the cache, everything else would remain as is.

Ok, but how would user use this?

Externally everything would work the same, the library would internally read the posterior-hashes.json and use that to lookup the correct version in the cache.

ahartikainen commented 3 years ago

Externally everything would work the same, the library would internally read the posterior-hashes.json and use that to lookup the correct version in the cache.

This is ok. But how the correct version is defined?

I think we don't currently have a way to version control models?

So --> posterior-hashes --> path to hashed_filename --> read posteriors from here.

Then what about Model + Data? They are not read from any json but scraped from the directory.

eerolinna commented 3 years ago

The version is just the hash. I meant that the cache can contain many posterior.json files at the same time and the hash lets us know which one is the correct version for the posterior-hashes.json that was used to request posterior.json

posterior-hashes.json would include models and data too. So you'd have models/info/accel_gp.info.json as a key there and would use that.

eerolinna commented 3 years ago

posterior-hashes.json would also provide a path towards hosting the files outside github

{
  "posterior_database/alias/posteriors.json": {
    "hash": "47f01b7be6855f9451c70426230ef68db16f982e",
    "download_url": "this url can point to anywhere, for example to AWS S3"
  },
  "..."
}

Of course currently to generate posterior-hashes.json you need the github API, but ideally the library would just download it from somewhere. This would eliminate the need for the user to have GITHUB_PAT as an environment variable (to be clear I'm not suggesting to change this now, just that it is possible later)

ahartikainen commented 3 years ago

Yeah, I'm not agaist this, just trying to figure out the structure