schwilklab / skyisland-climate

Climate data and code for Sky Island project
2 stars 2 forks source link

Reconstructing historical daily temperature summaries: computational nightmare #35

Closed dschwilk closed 7 years ago

dschwilk commented 8 years ago

Matrix multiplication for converting predicted PCA loadings and predicted PCA scores back to tmins/tmax across landscape and time is memory intensive. Fort example,

the below code illustrates the problem. ts is the full predicted historical PCA scores (14360 dates) and tl is the full landscape tmin loadings for the DM (1290564 locations). Sot he resulting matrix should be 1290564 rows and 14360 columns and that is more cells than can be represented by a 64 bit integer.

 res <- as.matrix(ts[,2:3]) %*% t(as.matrix(tl[,3:4]))
Error: cannot allocate vector of size 138.1 Gb

I'm ashamed to say I did not foresee this problem.

dschwilk commented 8 years ago

If we subsample the landscape to 1/100 the original resolution. This can run on my machine (and in seconds). But that is reducing the data to 1 percent of the original. Other ideas?

dschwilk commented 8 years ago

ooh. I bet I can do it in chunks. maybe decadal chunks on the temporal side, and tenths of the orinigal landscape on the other.

dschwilk commented 8 years ago

Ok, so if I load bigmemory (for as.big.matrix) AND the bigalgebra package, I can get this to at least try to run but still it crashes. Perhaps this is a solution however if I move calculations to the linux cluster at TTU ?

dschwilk commented 7 years ago

OK, I have access to a high memory queue and will try to get this working. Info on job submissions below:

I have given you access to the ivy-highmem queue. Each node has around 256GB of
memory and 20 cores. You will need to update your job submission script to
change the queue name and the requested number of cores. Here is an example
submission script:

#!/bin/sh
#$ -V
#$ -N Ivy-highmem-job
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -cwd
#$ -S /bin/bash
#$ -P hrothgar
#$ -pe fill 40
#$ -q ivy-highmem
dschwilk commented 7 years ago

Running (in chunks) as of ce8d45b. But storage needs are enormous. Currently running on a hrothgar high memory node and about 1/4 of the way through CM tmin after three hours. So total run time is only several days, that is not terrible. But I am chunking the data into 2500 location chunks (eg portions of the landscape at a time. each 2500 xy chunk is about 600MB for the historical tmin predictions.

dschwilk commented 7 years ago

For now, I am simply splitting into landscape chunks. This works. I have successfully run tmin for the CM. This results in 765 RDS files (each a data frame with time series for 1500 landscape points). Total size of these RDS files is 171 GB. But DM and GM will be larger. @hpoulos : can we clip these landscapes first? Help with that? This would happen in predict-spatial.R.

dschwilk commented 7 years ago

See also #37 and #36 solving these will allow more rapid computation.

dschwilk commented 7 years ago

saving only annual summaries essentially solves this (see #36).