Closed majidaldo closed 9 years ago
I'm not the maintainer but in my opinion it's hard to predict this and it depends on the hardware in use. Furthermore converting between CPU and CUDA theano objects is non-trivial. So to do this right you'd need an idea of the memory on the GPU, the size of the network and your batch_size and the size of the input data. I think this is a manual optimization best left to the programmer. Alternatively having a function to help profile and search for a solution might be best.
I agree, this would be quite difficult in practice. For the CPU/GPU abstraction theanets relies on features provided by Theano. (One of the big benefits of Theano is that it provides just this abstraction, so that theanets mostly doesn't have to be aware of it.)
You can always profile your theanets program easily by running it in Theano's profile mode:
THEANO_FLAGS=profile=True my_script.py
This will print out a large amount of profiling information that you can use to determine which parts of the computation graph are taking up the most time.
mmm yea i'm thinking if you had to do this you would have two theanets instances as services, one cpu and the other gpu, that would get training samples from a dispatcher that would know how to best distribute the work.
also too bad nvidia stopped supporting the ability to run CUDA objects on CPUs.
I'm going to go ahead and close this; doesn't seem feasible to do.
small problem sizes don't necessarily benefit from gpu computation. would it be easy to add a few lines that check how performance on each processor is and then just switch to using that?
it could be more granular too as sample sizes can vary in a training session.