Upload simulated data to bety-test

dlebauer commented 8 years ago

Description

Data from the simulated phenotypes (terraref/reference-data#20) are ready for upload to the bety-test database (https://terraref.ncsa.illinois.edu/bety-test).

The full upload is available as a .zip file here: https://uofi.box.com/v/observationsforbetyupload

There are about 2,700,000 rows in the file, each with 6 traits. So this will be a very large upload.

Trying to upload of a subset of 400 rows leads to 504 gateway timeout (I haven't seen this on betydb.org).

Here are two subsets for testing: subset_420_rows.txt and subset_4780_rows.txt

What would be the code for posting via the API?

gsrohde commented 8 years ago

I haven't looked at this closely (and may not have time to for a while), but here are a few observations:

At what stage do you get the 504 error? I was able to reach the validation page for the 420-row file (but not the 4780-row one) the first time I tried it (but then tried again and got the error). I was able to validate the 4780-row file on betydb.org, though it took quite a while (I didn't time it, but I'm guessing close to 10 minutes).
bety-test is running in a VM, so it probably has more limited computational resources. It's also running under nginx instead of apache, though I don't know if that has bearing on the problem.
Possibly the bottleneck has to do with database lookups: The cultivar, site, and species have to be looked up for each row in order to know what site id, species id, and cultivar id to insert in each new traits row. I would be interesting to try a subset having uniform site, species and cultivar, removing these three columns from the CSV table and specifying them interactively. I think the lookup would then be done only once instead of for each row.

The bulk upload code probably should cache the looked up values, but this doesn't now happen. It might also help to index the lookup columns (for example, sites.sitename and species.scientificname).
When I wrote the bulk upload code, we were still saving session data in cookies which meant only a limited amount of information could be stored in the session. Therefore, multiple stages of the wizard had to re-parse the CSV file and so on. Now that we store session data on the server, this could be changed so that the parsing was only done once and the information cached. This is one respect in which the new upload API is better, since everything happens in one HTTP request. (But I haven't documented this part of the API yet.)

dlebauer commented 8 years ago

Looks like the time before timeout can be set in ngnix.conf, but I can find it. I'll just generate insert statements in R.

dlebauer commented 8 years ago

Done. How to do it in R:


library(data.table)
library(dplyr)
library(tidyr)

# data.csv from https://terraref.ncsa.illinois.edu/bety-test/sites.csv
sites <- fread('~/Downloads/data.csv')
setnames(sites, c('Id', 'Sitename'), c('site_id', 'sitename'))

# data (1).csv from https://terraref.ncsa.illinois.edu/bety-test/cultivars.csv
cultivars <- fread("~/Downloads/data (1).csv")
setnames(cultivars, c('Id', 'Name'), c('cultivar_id', 'cultivar'))

# data (2).csv from https://terraref.ncsa.illinois.edu/bety-test/variables.csv
variables <- fread("~/Downloads/data (2).csv")
#unique(longdata3$trait) %in% variables$Name
setnames(variables, c('Id', 'Name'), c('variable_id', 'trait'))

library(lubridate)

#data <- fread("~/Downloads/subset_420_rows.txt")
data <- fread("~/Downloads/observations_for_bety_upload.csv")
data$entity_id <- rownames(data)
setnames(data, 'site','sitename')

insert <- data %>% 
  gather(key = trait, value = mean, stem_biomass, leaf_biomass, root_biomass,  LAI, NDVI, canopy_height) %>%
  mutate(specie_id = 7000000001) %>% 
  merge(sites[,list(site_id, sitename)], by = 'sitename') %>% 
  merge(cultivars[,list(cultivar_id, cultivar)], by = 'cultivar') %>% 
  merge(variables[,list(variable_id, trait)], by = 'trait') %>% 
  mutate(insert = paste('insert into traits(site_id, date, specie_id, cultivar_id, variable_id, mean, entity_id, created_at, updated_at) values (', paste(site_id, paste0("'",date,"'"), specie_id, cultivar_id, variable_id, mean, entity_id, sep = ","), ', now(), now());')) %>% 
  select(insert)

writeLines(insert[,1], con = '~/insert.sql')

# insert.sql is 3.8 GB 
# rsync insert.sql server:
# ssh to server
# nohup psql -d bety -U bety < insert.sql &

robkooper commented 8 years ago

nginx will have a timeout of 60 seconds. If no response is received after 60 seconds it thinks the remote (bety) is not responding and will return a 504.

terraref / computing-pipeline

Upload simulated data to bety-test #107

Description