terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
24 stars 13 forks source link

Add error return code to terraref.sh at ncap2 error on insufficient memory #158

Closed yanliu-chn closed 8 years ago

yanliu-chn commented 8 years ago

@czender as we discussed, it'd be good to return with error when not enough memory is allocated for ncap2 and caused it to fail.

Here is an example of the error. Currently the program continues to finish with return code 0.

(pyenv)ubuntu@hyperspectral-ex-vm:~$ ./computing-pipeline/scripts/hyperspectral/terraref.sh -d 1 -I terraref-hyperspectral-input-sample -O output
Terraref hyperspectral data workflow invoked with:
terraref.sh -d 1 -I terraref-hyperspectral-input-sample -O output
Hyperspectral workflow scripts in directory /home/ubuntu/computing-pipeline/scripts/hyperspectral
NCO version "4.6.1" from directory /srv/sw/nco-4.6.1/bin
Intermediate/temporary files written to directory /tmp
Final output stored in directory output
Input #00: terraref-hyperspectral-input-sample/0596c17f-2e4c-4d43-9d77-cde8ffbde663_raw
trn(in)  : terraref-hyperspectral-input-sample/0596c17f-2e4c-4d43-9d77-cde8ffbde663_raw
trn(out) : /tmp/terraref_tmp_trn.nc.pid6791.fl00.tmp
ncks -O --trr_wxy=955,1600,468 --trr typ_in=NC_USHORT --trr typ_out=NC_USHORT --trr ntl_in=bil --trr ntl_out=bsq --trr_in=terraref-hyperspectral-input-sample/0596c17f-2e4c-4d43-9d77-cde8ffbde663_raw /home/ubuntu/computing-pipeline/scripts/hyperspectral/dummy.nc /tmp/terraref_tmp_trn.nc.pid6791.fl00.tmp
att(in)  : /tmp/terraref_tmp_trn.nc.pid6791.fl00.tmp
att(out) : /tmp/terraref_tmp_att.nc.pid6791.fl00.tmp
ncatted -O --gaa terraref_script=terraref.sh --gaa terraref_hostname=hyperspectral-ex-vm --gaa terraref_version="4.6.1" -a "Conventions,global,o,c,CF-1.5" -a "Project,global,o,c,TERRAREF" --gaa history="Thu Aug 25 16:55:55 UTC 2016: terraref.sh -d 1 -I terraref-hyperspectral-input-sample -O output" /tmp/terraref_tmp_trn.nc.pid6791.fl00.tmp /tmp/terraref_tmp_att.nc.pid6791.fl00.tmp
jsn(in)  : terraref-hyperspectral-input-sample/0596c17f-2e4c-4d43-9d77-cde8ffbde663_raw
jsn(out) : /tmp/terraref_tmp_jsn.nc.pid6791
python /home/ubuntu/computing-pipeline/scripts/hyperspectral/JsonDealer.py terraref-hyperspectral-input-sample/0596c17f-2e4c-4d43-9d77-cde8ffbde663_raw /tmp/terraref_tmp_jsn.nc.pid6791.fl00.tmp
mrg(in)  : /tmp/terraref_tmp_jsn.nc.pid6791.fl00.tmp
mrg(out) : /tmp/terraref_tmp_att.nc.pid6791.fl00.tmp
ncks -A /tmp/terraref_tmp_jsn.nc.pid6791.fl00.tmp /tmp/terraref_tmp_att.nc.pid6791.fl00.tmp
clb(in)  : /tmp/terraref_tmp_att.nc.pid6791.fl00.tmp
clb(out) : /tmp/terraref_tmp_clb.nc.pid6791.fl00.tmp
ncap2 -A -S /home/ubuntu/computing-pipeline/scripts/hyperspectral/terraref.nco /tmp/terraref_tmp_att.nc.pid6791.fl00.tmp /tmp/terraref_tmp_att.nc.pid6791.fl00.tmp;/bin/mv -f /tmp/terraref_tmp_att.nc.pid6791.fl00.tmp /tmp/terraref_tmp_clb.nc.pid6791.fl00.tmp
ncap2: ERROR nco_malloc() unable to allocate 5720832000 B = 5586750 kB = 5455 MB = 5 GB
ncap2: INFO NCO has reported a malloc() failure. malloc() failures usually indicate that your machine does not have enough free memory (RAM+swap) to perform the requested operation. As such, malloc() failures result from the physical limitations imposed by your hardware. Read http://nco.sf.net/nco.html#mmr for a description of NCO memory usage. The likiest case is that this problem is caused by inadequate RAM on your system, and is not an NCO bug. If so, there are two potential workarounds: First is to process your data in smaller chunks, e.g., smaller or more hyperslabs. The second is to use a machine with more free memory, so that malloc() succeeds.

Large tasks may uncover memory leaks in NCO. This is likeliest to occur with ncap2. ncap2 scripts are completely dynamic and may be of arbitrary length and complexity. A script that contains many thousands of operations may uncover a slow memory leak even though each single operation consumes little additional memory. Memory leaks are usually identifiable by their memory usage signature. Leaks cause peak memory usage to increase monotonically with time regardless of script complexity. Slow leaks are very difficult to find. Sometimes a malloc() failure is the only noticeable clue to their existence. If you have good reasons to believe that your malloc() failure is ultimately due to an NCO memory leak (rather than inadequate RAM on your system), then we would like to receive a detailed bug report.
rip(in)  : /tmp/terraref_tmp_clb.nc.pid6791.fl00.tmp
rip(out) : output/0596c17f-2e4c-4d43-9d77-cde8ffbde663.nc
/bin/mv -f /tmp/terraref_tmp_clb.nc.pid6791.fl00.tmp output/0596c17f-2e4c-4d43-9d77-cde8ffbde663.nc
Cleaning-up intermediate files...
Quick views of last processed data file and its original image (if any):
ncview  output/0596c17f-2e4c-4d43-9d77-cde8ffbde663.nc &
panoply output/0596c17f-2e4c-4d43-9d77-cde8ffbde663.nc &
open terraref-hyperspectral-input-sample/0596c17f-2e4c-4d43-9d77-cde8ffbde663_image.jpg
Elapsed time 1m50s
(pyenv)ubuntu@hyperspectral-ex-vm:~$ echo $?
0
(pyenv)ubuntu@hyperspectral-ex-vm:~$
czender commented 8 years ago

@yanliu-chn the problem was indeed that terraref.sh was ignoring a non-zero return code due to recent changes. now fixed. please pull and try again and close issue if you receive the non-zero return code on exit.

czender commented 8 years ago

@yanliu-chn also, fyi, i reduced the memory required by ncap2 from 5x sizeof(raw) to 4x sizeof(raw). best to allot 4.1x so as not to cut it too close. the footprint cannot be further reduced without significantly slowing the processing. half the memory holds the original image (promoted from short to float) and the other half holds the computed reflectance (float). so 4x raw image is the smallest natural size of the computation without using loops.

yanliu-chn commented 8 years ago

I confirm, after testing on VM that caused error and ROGER which succeeded, that the changes you made work as expected. Thanks a lot for fixing this!

Please feel free to close this ticket.