System Requirements for Model Training

ghost commented 6 years ago

Hi Greg, I have a few queries, Please do answer

I am running your code and I always get Insufficient Memory or other error.I wish to know at what Configuration will your Model codes work.I am Opting for Google Cloud. So I need Exact Configuration like RAM,GPU, Processor, HDD?
I am running it in My laptop with 8GB RAM, 4GB Nvidia 940MX when executing cnn.py It show CUDA Allocation Error Insufficient Memory.Is it possible to run in this Laptop?
cnn.py,cnn_class.py,cnn_multi.py. What are the difference? What are their Individual System Requirements?How many GPU is minimum for running CNN Model?
Is a Confusion Matrix Implementaion Possible in your Code?
Is it possible to give input sample data image after Training the model completely and get predict and predict_classes?How to Do it?
After Training the model and saving it can we Copy the model and Run it in a Lower Configuration Laptop like Above?Can you Add Trained_Complete_Model.h5 to Repo?

Please Do reply.

gregwchase commented 6 years ago

@dhanasekar416 Answers below!

I used the p2.8xlarge instance on AWS. This consists of K80 GPU's, 488GB RAM, 32 vCPU's, and ~200GB of storage on the instance.
You can absolutely run EyeNet on the laptop, but you'll have to alter the code to feed in a certain amount of images at a time. Otherwise, you'll run into CUDA memory issues.
cnn.py: The code you want to run. This is the "master" file. cnn_class.py: The same code as cnn.py, but reorganized as a class. Currently incomplete, but this will become the new version of cnn.py soon. cnn_multi.py: Trained across all 5 categories. cnn.py is trained as a binary classification problem with two categories (retinopathy or not).
A confusion matrix is possible, but I've never implemented it.
Same as 4; this is possible, but I've never gone into implementing this. It may be as easy as using "model.predict()" on new images.
It is possible to save the model and run at a lower configuration. Unfortunately, the model is too big to add to the repository.

ghost commented 6 years ago

Thanks for the Reply Greg.

What is the minimum requirement for it to Run in one of the posts you said that 100GB RAM is sufficient? When I change input image size=(128,128) my Laptop crashes due to insufficient Memory,should that happen normally? Not to mention 256,256?DO specify the minimum specs to Run.
Does 512,512 Images give better performance in your model then 256,256?Is there any difference?
Is there anyway to share the Model you have after completely training it?Like 'S3' download link? What version of Tensorflow,CUDA cnDNN did you use?
I am thinking of Using a Compute Engine with 32CPUs , 120GB RAM, 4/8 GPUs Nvidia K80. Will they suffice? How long did it take you to train the model from preprocessing stage to saving the model? "Is it possible to run in Instance with CPU and RAM alone if so what requirement will they need?"
What does stacking the same Conv2d Layers achieve?
Can you post an image of the Neural Network Architecture, the one with neurons and shows th input and output neuron if each layer seperately? E.g : https://bit.ly/2J2baA5 but with Details like what is send in each neuron and each layers seperately?

gregwchase commented 6 years ago

@dhanasekar416 More answers below.

100GB of RAM is more than sufficient; the instance just comes with more than was needed. Scaling down to 128x128 shouldn't give you issues with that much DRAM.
I did try with 512x512, but found no noticeable improvement.
I might have an S3 download link in the future, once EyeNet is a little more performant. TensorFlow: 1.2 CUDA: 8 cuDNN: 7 (I think this is correct)
That Compute Engine should work, no problem; the GPU's are the priority with this type of problem. For this problem, it took roughly 30-40 minutes with 8 GPU's. I would not recommend running on CPU's when training.
My model architecture follows the VGG architecture. VGG consists of multiple layers, then pooling, as opposed to layer > pool > layer > pool, etc. Blocks of layers with the same filter size applied multiple times is used to extract more complex and representative features.
I've tried to do this, but felt it took more time than it was worth. It might be something I add in the future!

ghost commented 6 years ago

@gregwchase Well do you have any detailed report on this project explaining the complete project clearly like why you used 3 ConvLayer, Principle and other stuff? If you do have can you post it. And If I resize image during preprocessing to 128 x 128 will there still be good accuracy?

karankumar-07 commented 6 years ago

Hi greg. If i use original pixel values instead of resizing, will it improve my accuracy. I am planning to use aws p2.16x large.

gregwchase commented 6 years ago

@CodeRed1704 Since all images are varying sizes, they need to be resized identically.

That said, using larger images can improve accuracy; for this project, I didn't notice an improvement using images at 512x512 resolution. Because of this, they were resized to 256x256. This yielded faster training, with comparable results.

A p2.16xlarge instance may be overkill; I'd suggest a p2.8xlarge.

gregwchase / eyenet

System Requirements for Model Training #5