terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
24 stars 13 forks source link

Add PosgreSQL Studio application to terraref.ndslabs.org #302

Closed dlebauer closed 7 years ago

dlebauer commented 7 years ago

It would be great to have access to BETYdb using PostgreSQL Studio with either a) read-only access to bety6 or b) a local 'sandbox' copy.

For a read only connection, this is how we have provided support for GIS users (as described in documentation).

Users can request ssh login access. But it would be ideal if we could provide a generic existing tunnel.

ssh -nF -L 5432:localhost:5432 <login>@bety6.ncsa.illinois.edu

Then the postgres server is mounted (?) in a way that I can treat it as a local database.

Completion Criteria

craig-willis commented 7 years ago

I've added PG studio to the catalog. Given the discussion today, it sounds like the second requirement will require me to come up with a strategy to sync the BETY data to a local instance on the Workbench server.

dlebauer commented 7 years ago

Strategy for syncing BETYdb: @robkooper is the architect of the system, but there should be two steps

  1. On bety6.ncsa.illinois.edu, set cron job to dump all contents of bety6
  2. on a local postgres server run load.bety.sh -c -r 6
  3. Once this is worked out, document steps required to clone & keep current the terraref trait database (bety6)

A draft workflow that will likely require feedback from @robkooper

Dump

Currently there are cron jobs to dump records from bety6, but neither is sufficient.

Import

To import data from bety6 using load.bety.sh

load.bety.sh -c -r 6 # to create the database
load.bety.sh -r 6     # to update the database in a cron job

should work, but don't. One reason is that it is missing records from remote sites 0 and 8. We do not want to run load.bety.sh -r 0 because we don't want to bloat bety6 with all of the public data and metadata from that system. The subset of records that I have copied from bety0 to bety6 (e.g. variables, pfts, priors) should be kept.

to import dump

cd /tmp
rsync bety6.ncsa.illinois.edu:/home/backup/postgresql/dumpall.sql.gz
tar -xvf dumpall.sql.gz
psql -d bety < dumpall.sql.gz

The pg_dumpall script could be modified to also provide a dump with anonymized users largely reusing code from line 194 of dump.bety.sh https://github.com/PecanProject/pecan/blob/master/scripts/dump.bety.sh#L194

Related

https://github.com/terraref/reference-data/issues/111 https://github.com/PecanProject/pecan/issues/1355

craig-willis commented 7 years ago

@robkooper Looking again at how best to do this. David mentioned that he's remove viewer access from the users table (https://github.com/terraref/computing-pipeline/issues/262) , so direct connection may be an option again. As discussed last week, we have a few options:

If we can't get access to 5432 directly from workbench via internal network, my inclination is to scp or rsync the dumpall output. I guess I could setup authorized keys between Workbench and ROGER (or bety6). In this case, I know we'd need to anonymize after import.

Thoughts?

robkooper commented 7 years ago

You can use the updated load.bety.sh script from https://raw.githubusercontent.com/PecanProject/pecan/357b70e21592375827e9c55bc5e1ce8e3426bd78/scripts/load.bety.sh

This allows you to specify the URL to get the data from:

load.bety.sh -c -u -m 99 -r 0 -w https://terraref.ncsa.illinois.edu/bety/dump/bety0/bety.tar.gz
load.bety.sh -m 99 -r 6 -w https://terraref.ncsa.illinois.edu/bety/dump/bety6/bety.tar.gz

This will reset the database (-c) and create some default users (-u).

craig-willis commented 7 years ago

Thanks, Rob. I have an almost working postgis container that uses the above load process. Just doing some sanity checking, I noticed the traits table has no rows?

dlebauer commented 7 years ago

@robkooper did you set up a cron job to dump and load the data? If so could you paste it here so it will be added to the documentation? https://github.com/terraref/documentation/issues/153

dlebauer commented 7 years ago

@robkooper PS thanks for such a quick turnaround!

dlebauer commented 7 years ago

@robkooper regarding the traits table not having any rows ... did you use -l 2 in the dump.bety.sh script to dump data with access_level = 2?

robkooper commented 7 years ago

The following runs every hour on the hour:

#!/bin/bash

wget -q -O /home/bety/dump.bety.sh https://raw.githubusercontent.com/PecanProject/pecan/master/scripts/dump.bety.sh
chmod 755 /home/bety/dump.bety.sh

/home/bety/dump.bety.sh -q -l 2 -u YES -m 0 -o /home/bety/bety/public/dump/bety0
/home/bety/dump.bety.sh -q -l 2 -u YES -m 6 -o /home/bety/bety/public/dump/bety6
robkooper commented 7 years ago

@craig-willis can you check on the import and see if there is any errors. Looking at the dump there should be traits data:

-rw-r--r-- bety/users 159266947 2017-05-19 09:34 ./traits.csv
craig-willis commented 7 years ago

Thanks, @robkooper . My mistake, it looks like I was querying the table before the traits data had completed loading.

craig-willis commented 7 years ago

@dlebauer @robkooper

I have a first-pass at this running on terraref.ndslabs.org. The Dockerfile is on github:

https://github.com/terraref/toolserver-dockerfiles/tree/kitchensink/bety

You can connect to the DB via the Postgres Studio IP: 10.0.0.183 Database: bety User: bety Password: bety

This is only accessible from within workbench. You should also be able to connect via any postgres client.

Feedback welcome. In the end, I'll provide a name instead of IP, but for today you just have the IP.

dlebauer commented 7 years ago

@craig-willis thanks! could you run commands to restrict read access to tables from the bety user? I think it helps to clear the clutter https://gist.github.com/dlebauer/4aae063ac38d1dba9859da1ddfb5d2c5#file-revoke_select_from_bety_user-sql

Also, when I load the SQL Worksheet and try to run

select count(*) from traits;

I get the error message as popup

Server returned TRANSPORT_ERROR with no error message

image

dlebauer commented 7 years ago

Actually, that only happens if I hit enter in the window and then try to run it. Somehow hitting enter causes it to log out because when I reload the page I am redirected to login. If I just click the green 'play' button it returns the expected result. Perhaps a bug in the interface, not a big deal.

craig-willis commented 7 years ago

I've run your SQL statements and will add to the container.

craig-willis commented 7 years ago

@dlebauer I still need to implement the cronjob on this, but is it otherwise working as hoped?

dlebauer commented 7 years ago

Yep it is awesome. Thanks! On Thu, May 25, 2017 at 9:09 AM Craig Willis notifications@github.com wrote:

@dlebauer https://github.com/dlebauer I still need to implement the cronjob on this, but is it otherwise working as hoped?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/computing-pipeline/issues/302#issuecomment-304019260, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX5yDa1wOx9A2-5f2TGXBd5xoJ6vImks5r9YucgaJpZM4NGliu .