terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
24 stars 13 forks source link

Upload simulated data to bety-test #107

Closed dlebauer closed 8 years ago

dlebauer commented 8 years ago

Description

Data from the simulated phenotypes (terraref/reference-data#20) are ready for upload to the bety-test database (https://terraref.ncsa.illinois.edu/bety-test).

The full upload is available as a .zip file here: https://uofi.box.com/v/observationsforbetyupload

There are about 2,700,000 rows in the file, each with 6 traits. So this will be a very large upload.

Trying to upload of a subset of 400 rows leads to 504 gateway timeout (I haven't seen this on betydb.org).

Here are two subsets for testing: subset_420_rows.txt and subset_4780_rows.txt

What would be the code for posting via the API?

gsrohde commented 8 years ago

I haven't looked at this closely (and may not have time to for a while), but here are a few observations:

dlebauer commented 8 years ago

Looks like the time before timeout can be set in ngnix.conf, but I can find it. I'll just generate insert statements in R.

dlebauer commented 8 years ago

Done. How to do it in R:


library(data.table)
library(dplyr)
library(tidyr)

# data.csv from https://terraref.ncsa.illinois.edu/bety-test/sites.csv
sites <- fread('~/Downloads/data.csv')
setnames(sites, c('Id', 'Sitename'), c('site_id', 'sitename'))

# data (1).csv from https://terraref.ncsa.illinois.edu/bety-test/cultivars.csv
cultivars <- fread("~/Downloads/data (1).csv")
setnames(cultivars, c('Id', 'Name'), c('cultivar_id', 'cultivar'))

# data (2).csv from https://terraref.ncsa.illinois.edu/bety-test/variables.csv
variables <- fread("~/Downloads/data (2).csv")
#unique(longdata3$trait) %in% variables$Name
setnames(variables, c('Id', 'Name'), c('variable_id', 'trait'))

library(lubridate)

#data <- fread("~/Downloads/subset_420_rows.txt")
data <- fread("~/Downloads/observations_for_bety_upload.csv")
data$entity_id <- rownames(data)
setnames(data, 'site','sitename')

insert <- data %>% 
  gather(key = trait, value = mean, stem_biomass, leaf_biomass, root_biomass,  LAI, NDVI, canopy_height) %>%
  mutate(specie_id = 7000000001) %>% 
  merge(sites[,list(site_id, sitename)], by = 'sitename') %>% 
  merge(cultivars[,list(cultivar_id, cultivar)], by = 'cultivar') %>% 
  merge(variables[,list(variable_id, trait)], by = 'trait') %>% 
  mutate(insert = paste('insert into traits(site_id, date, specie_id, cultivar_id, variable_id, mean, entity_id, created_at, updated_at) values (', paste(site_id, paste0("'",date,"'"), specie_id, cultivar_id, variable_id, mean, entity_id, sep = ","), ', now(), now());')) %>% 
  select(insert)

writeLines(insert[,1], con = '~/insert.sql')

# insert.sql is 3.8 GB 
# rsync insert.sql server:
# ssh to server
# nohup psql -d bety -U bety < insert.sql &
robkooper commented 8 years ago

nginx will have a timeout of 60 seconds. If no response is received after 60 seconds it thinks the remote (bety) is not responding and will return a 504.