nitishsrivastava / deepnet

Implementation of some deep learning algorithms.
BSD 3-Clause "New" or "Revised" License
893 stars 438 forks source link

Memory usage on AWS #24

Closed rdaniel closed 11 years ago

rdaniel commented 11 years ago

Hi Nitish,

Thanks for making deepnet available. I'm looking forward to working with it. I'm able to run most of the examples, but when I try the multimodal_dbn, I get MemoryErrors when extract_rbm_representation.py is being run. I'm running on an AWS Cluster GPU instance which has 6GB on the GPU and 22 GM main memory. I've been progressively shrinking the gpu_mem (from 4 to 2 to 1) and main_mem values (20,18,16,10,8). With them set to 1G and 8G, respectively, the image layer 1 extract completed for the first time, and layer 2 trained, but then the extract on layer 2 failed. The error dump is appended.

Any suggestions on how to fix this? I notice in extract_rbm_representation that there is an additional memory=10G setting - do I need to make that be no larger than the main_mem setting?

Thanks, Ron

Writing to /vol/FlickrPreproc/flickr/dbn_reps/image_rbm2_LAST/train 998Traceback (most recent call last): File "/home/ubuntu/src/deepnet-master/deepnet/extract_rbm_representation.py", line 81, in main() File "/home/ubuntu/src/deepnet-master/deepnet/extract_rbm_representation.py", line 76, in main data_proto=data_proto) File "/home/ubuntu/src/deepnet-master/deepnet/extract_rbm_representation.py", line 40, in ExtractRepresentations layernames, output_dir, memory=memory, dataset=dataset, input_recon=True) File "/home/ubuntu/src/deepnet-master/deepnet/dbm.py", line 360, in WriteRepresentationToDisk datagetter() File "/home/ubuntu/src/deepnet-master/deepnet/neuralnet.py", line 370, in GetTrainBatch self.GetBatch(self.train_data_handler) File "/home/ubuntu/src/deepnet-master/deepnet/dbm.py", line 229, in GetBatch super(DBM, self).GetBatch(handler=handler) File "/home/ubuntu/src/deepnet-master/deepnet/neuralnet.py", line 361, in GetBatch data_list = handler.Get() File "/home/ubuntu/src/deepnet-master/deepnet/datahandler.py", line 627, in Get batch = self.gpu_cache.Get(self.batchsize, get_last_piece=self.get_last_piece) File "/home/ubuntu/src/deepnet-master/deepnet/datahandler.py", line 396, in Get self.LoadData() File "/home/ubuntu/src/deepnet-master/deepnet/datahandler.py", line 332, in LoadData self.data[i].overwrite(mat) File "/home/ubuntu/src/deepnet-master/cudamat/cudamat.py", line 161, in overwrite array = reformat(array) File "/home/ubuntu/src/deepnet-master/cudamat/cudamat.py", line 1621, in reformat return np.array(array, dtype=np.float32, order='F') MemoryError ./runall_dbn.sh: line 71: 3880 Segmentation fault (core dumped) python ${extract_rep} ${model_output_dir}/image_rbm2_LAST trainers/dbn/train_CD_image_layer2.pbtxt image_hidden2 ${data_output_dir}/image_rbm2_LAST ${gpu_mem} ${cpu_mem}

ghost commented 11 years ago

Yes, that 10G should be made smaller.

rdaniel commented 11 years ago

Hi again Nitish.

I've tried 8, 4, and 1 G for the main_mem value. They all still die at line 1621. (with gpu_mem=1G) On a minor note, I noticed that the runall_dbn.sh script uses a $cpu_mem variable that is not defined in the script. However, things still die at the same line if I replace that with the main_mem value. Next step is for me to try debugging this and keep an eye on memory use, but if you have other ideas that would be great.

Thanks, Ron

rdaniel commented 11 years ago

I think I've found it. Didn't take long once I got time to step through the code in the debugger. The extract_rbm_representations.py code has the gpu_mem and cpu_mem parameters, but it also has a memory=10G parameter. That constant 10G value was passed along the WriteRepresentationsToDisk() call. I replaced that with the main_mem parameter and things worked. I'll close this issue now.