Feature Request: Allow mixed type operations

jeremysalwen commented 9 years ago

For example, I have an array of floats and a boolean mask

viennacl::vector<float> a;
viennacl::vector<bool> b;

I would like to be able to do

viennacl::vector<float> c=viennacl::linalg::element_prod(a,b)

Right now I just implement b as a vector of floats instead. I'm not sure if this would actually be a performance improvement to use integer types instead, but it would be nice to be able to compare.types.

karlrupp commented 9 years ago

Thanks! This request is an extension of https://github.com/viennacl/viennacl-dev/issues/80

Do you have a case where mixed type operations constitute a performance bottleneck? We have not yet run into such a case, hence the priority for mixed type operations remained low.

jeremysalwen commented 9 years ago

As I am writing the implementation using ViennaCL, I don't have a way to directly check. (I don't think I could create a well-written custom kernel either.) I can give you the circumstance in which I thought it seeme natural to perform this operation.

I am calculating on a neural network with dropout. Without dropout, the calculation of a single layer looks like ({B, A, R} matrices, {x, y, r} vectors)

B= tanh(A*x+y)

with dropout, we randomly set half of the values of A and y to zero, so the calculation looks like

B=tanh((A.*R)x+y.*r)

where R is a matrix of random booleans, and r is a vector of random booleans. The calculation of A.*R in particular is large.

karlrupp commented 9 years ago

Considering A.*R in double precision, this would allow for 17 bytes per entry rather than 24 bytes per entry, saving about 30 percent if memory channels can efficiently transfer single bytes. The latter isn't always the case, plus we would have to run an additional just-in-time compilation step with an overhead of about 1 second. Thus, you may still be better off just using the same types ;-)

jeremysalwen commented 9 years ago

This operation is far and above the bottleneck of my application, so I don't think the overhead of 1 second will matter. The 30% memory reduction sounds very promising to me, but I know little to nothing about memory channel efficiency, so I will shut my mouth :)

ptillet commented 9 years ago

Yes, to add to what Karl said, I think it would be certainly possible to achieve peak bandwidth even for mixed precision operations, but this would require quite some tuning. In your drop-out case, I think that you could update your weights matrix right after training (since doing dropout multiple times in a row is statistically meaningless):

// post processing

A = element_prod(A, R)

y = element_prod(y, r)

/.../

B=tanh(Ax+y)

You may be able to use R only as a temporary matrix, which will reduce the memory footprint (hence the PCI-E load) of your program

On Sat, Mar 7, 2015 at 3:19 PM, Jeremy Salwen notifications@github.com wrote:

This operation is far and above the bottleneck of my application, so I don't think the overhead of 1 second will matter. The 30% memory reduction sounds very promising to me, but I know little to nothing about memory channel efficiency, so I will shut my mouth :)

— Reply to this email directly or view it on GitHub https://github.com/viennacl/viennacl-dev/issues/124#issuecomment-77708005 .

jeremysalwen commented 9 years ago

I think you're right that I could just generate R and r as random vectors on the GPU. In such a case, would integer vs floating point matter at all for performance?

ptillet commented 9 years ago

Yes, sorry I got confused on drop-out, you are right. In this case, you are bound by A._R and y._r because you are not using mini-batches. GPUs are very good at matrix-matrix multiplications, but in the matrix-vector case the PCI-E and OpenCL setup latency is likely to kill you, I would advise you to preprocess your dataset into an std::listviennacl::matix (or if you use vector, be very, very, very careful about viennacl objects copy constructors which are deep), so that memory-bound operations don't slow down your code too much.

On Sat, Mar 7, 2015 at 3:46 PM, Jeremy Salwen notifications@github.com wrote:

Hmm, I think things are perhaps a bit unclear. The main loop of my application is roughly

for x in samples copy_to_gpu(x) R=random_boolean_matrix() r=random_boolean_vector() B=tanh((A._R)x+y._r) ... A+=some_correction(A,R,x,y,r)

— Reply to this email directly or view it on GitHub https://github.com/viennacl/viennacl-dev/issues/124#issuecomment-77709196 .

jeremysalwen commented 9 years ago

Judging based on the metrics here, it seems using integer representations really does improve performance:

http://cs.nyu.edu/~wanli/dropc/ They get a >3x speed from switching from float to bit representations of the mask

I don't think memeory transfer would actually be an issue if everything is kept on the GPU between iterations (including generating R and r) except the sample x.

EDIT: It does seem like it is an issue of memory on the device

"A key component to successfully training with DropConnect is the selection of a different mask for each training example. Selecting a single mask for a subset of training examples, such as a mini-batch of 128 examples, does not regularize the model enough in practice. Since the memory requirement for the M's now grows with the size of each mini-batch, the implementation needs to be carefully designed as described in Section 5"

(note R=M ) Where they essentially say that it doesn't fit in device memory if R is represented as a floating point matrix

ptillet commented 9 years ago

They probably obtain a speed-up because they are using texture memory, which we don't handle in ViennaCL. It is also hard to say whether they get a speedup because the kernel is not optimized for global memory, or whether bitwise representation genuinely helps

On Sat, Mar 7, 2015 at 4:59 PM, Jeremy Salwen notifications@github.com wrote:

Judging based on the metrics here, it seems using integer representations really does improve performance:

http://cs.nyu.edu/~wanli/dropc/ They get a >3x speed from switching from float to bit representations of the mask

I don't think memeory transfer would actually be an issue if everything is kept on the GPU between iterations (including generating R and r) except the sample x.

— Reply to this email directly or view it on GitHub https://github.com/viennacl/viennacl-dev/issues/124#issuecomment-77712765 .

viennacl / viennacl-dev

Feature Request: Allow mixed type operations #124