Closed dlebauer closed 8 years ago
I haven't looked at this closely (and may not have time to for a while), but here are a few observations:
Possibly the bottleneck has to do with database lookups: The cultivar, site, and species have to be looked up for each row in order to know what site id, species id, and cultivar id to insert in each new traits row. I would be interesting to try a subset having uniform site, species and cultivar, removing these three columns from the CSV table and specifying them interactively. I think the lookup would then be done only once instead of for each row.
The bulk upload code probably should cache the looked up values, but this doesn't now happen. It might also help to index the lookup columns (for example, sites.sitename
and species.scientificname
).
Looks like the time before timeout can be set in ngnix.conf, but I can find it. I'll just generate insert statements in R.
Done. How to do it in R:
library(data.table)
library(dplyr)
library(tidyr)
# data.csv from https://terraref.ncsa.illinois.edu/bety-test/sites.csv
sites <- fread('~/Downloads/data.csv')
setnames(sites, c('Id', 'Sitename'), c('site_id', 'sitename'))
# data (1).csv from https://terraref.ncsa.illinois.edu/bety-test/cultivars.csv
cultivars <- fread("~/Downloads/data (1).csv")
setnames(cultivars, c('Id', 'Name'), c('cultivar_id', 'cultivar'))
# data (2).csv from https://terraref.ncsa.illinois.edu/bety-test/variables.csv
variables <- fread("~/Downloads/data (2).csv")
#unique(longdata3$trait) %in% variables$Name
setnames(variables, c('Id', 'Name'), c('variable_id', 'trait'))
library(lubridate)
#data <- fread("~/Downloads/subset_420_rows.txt")
data <- fread("~/Downloads/observations_for_bety_upload.csv")
data$entity_id <- rownames(data)
setnames(data, 'site','sitename')
insert <- data %>%
gather(key = trait, value = mean, stem_biomass, leaf_biomass, root_biomass, LAI, NDVI, canopy_height) %>%
mutate(specie_id = 7000000001) %>%
merge(sites[,list(site_id, sitename)], by = 'sitename') %>%
merge(cultivars[,list(cultivar_id, cultivar)], by = 'cultivar') %>%
merge(variables[,list(variable_id, trait)], by = 'trait') %>%
mutate(insert = paste('insert into traits(site_id, date, specie_id, cultivar_id, variable_id, mean, entity_id, created_at, updated_at) values (', paste(site_id, paste0("'",date,"'"), specie_id, cultivar_id, variable_id, mean, entity_id, sep = ","), ', now(), now());')) %>%
select(insert)
writeLines(insert[,1], con = '~/insert.sql')
# insert.sql is 3.8 GB
# rsync insert.sql server:
# ssh to server
# nohup psql -d bety -U bety < insert.sql &
nginx will have a timeout of 60 seconds. If no response is received after 60 seconds it thinks the remote (bety) is not responding and will return a 504.
Description
Data from the simulated phenotypes (terraref/reference-data#20) are ready for upload to the bety-test database (https://terraref.ncsa.illinois.edu/bety-test).
The full upload is available as a .zip file here: https://uofi.box.com/v/observationsforbetyupload
There are about 2,700,000 rows in the file, each with 6 traits. So this will be a very large upload.
Trying to upload of a subset of 400 rows leads to 504 gateway timeout (I haven't seen this on betydb.org).
Here are two subsets for testing: subset_420_rows.txt and subset_4780_rows.txt
What would be the code for posting via the API?