Open robiwano opened 5 years ago
Ok, I see that it is "half done". The parameter passing code is there in Kernel.h, but the associated mkArg is not, resulting in a linking error. Would really appreciate this looked upon :)
@mn416 thoughts on this ?
Hi @robiwano,
I agree, this is a desirable feature. There seem to be two possible approaches:
Support Ptr<Ptr<Float>>
in kernel arguments
Support a new variant of kernel call
, say callWithUniforms
, where the first argument is an std::vector
, and this vector can be read in stream fashion inside the kernel. To read the next element of the stream, the kernel simply calls getUniform()
.
As you say, it looks like (1) is half done. You could try adding
template <> inline Ptr<Ptr<Float>> mkArg< Ptr<Ptr<Float>> >() {
Ptr<Ptr<Float>> x;
x = getUniformPtr<Ptr<Float>>();
return x;
}
to Kernel.h
. If that works, then on the ARM side you can create a SharedArray<float*>
as follows:
SharedArray<float> floatsA(256);
SharedArray<float> floatsB(256);
SharedArray<float*> floatPointers(16);
floatPointers[0] = floatsA.getPointer();
floatPointers[1] = floatsB.getPointer();
As you say, it looks like (1) is half done. You could try adding
template <> inline Ptr<Ptr<Float>> mkArg< Ptr<Ptr<Float>> >() { Ptr<Ptr<Float>> x; x = getUniformPtr<Ptr<Float>>(); return x; } Yes, I already tried that and it does make it link, however, the emulator crashes on an access violation, this is the test kernel:
void gpu_test(Int n, Ptr<Ptr
> a_s) { Ptr p = a_s[0]; Float val = *p; Print(val) } and crash is in Emulator.cpp:
// LD2: wait for DMA completion case LD2: { assert(s->dmaLoad.active); uint32_t hp = (uint32_t) s->dmaLoad.addr.intVal; int vpmAddr = NUM_LANES (4s->id + (s->dmaLoad.buffer == A ? 0 : 1)); for (int i = 0; i < NUM_LANES; i++) { state.vpm[vpmAddr+i].intVal = emuHeap[hp>>2]; <<<< access violation hp += 4*(s->readStride+1); } s->dmaLoad.active = false; break; }
it seems the s->dmaLoad.addr.intVal has an invalid value.
Thanks for the debug info.
The getPointer()
method in SharedArray.h
looks wrong:
T* getPointer() {
return (T*) &emuHeap[address];
}
I think it should be:
T* getPointer() {
return (T*) address;
}
Of course, the return value should never actually be dereferenced on the ARM side, only inside a kernel.
I don't think I rely on the current getPointer()
definition anywhere, but it might be worth doing a grep -r
just to check.
Correction, I think it's:
T* getPointer() {
return (T*) (address*4);
}
Correction, I think it's:
T* getPointer() { return (T*) (address*4); }
You mean the same impl as for getAddress() ?
Ok, that seems to work in the emulator! Tonight I'll be able to try on the Pi zero.
Hmm... I seem to get a crash when doing gather/receive, my test kernel accumulates input vectors into an output vector:
#define USE_GATHER_RECEIVE 1
void gpu_test(Int num_inputs, Int inputs_length, Ptr<Ptr<Float>> inputs, Ptr<Float> output)
{
For(Int i = 0, i < num_inputs, i = i + 1) {
Ptr<Float> ptr_in = inputs[i];
Ptr<Float> ptr_out = output;
#if USE_GATHER_RECEIVE
gather(ptr_in); gather(ptr_out);
#endif
Float val_in, val_out;
For(Int n = 0, n < inputs_length, n = n + 16) {
#if USE_GATHER_RECEIVE
gather(ptr_in + 16); gather(ptr_out + 16);
receive(val_in); receive(val_out);
store(val_in + val_out, ptr_out);
#else
val_in = *ptr_in;
val_out = *ptr_out;
*ptr_out = val_in + val_out;
#endif
ptr_in = ptr_in + 16;
ptr_out = ptr_out + 16;
} End
#if USE_GATHER_RECEIVE
receive(val_in); receive(val_out);
#endif
} End
}
Setting USE_GATHER_RECEIVE to 0 yields correct output result, but setting it to 1 induces an access violation in Emulator.cpp:
case SPECIAL_HOST_INT: {
return;
}
case SPECIAL_TMU0_S: {
assert(s->loadBuffer->numElems < 8);
Vec val;
for (int i = 0; i < NUM_LANES; i++) { <<< i == 3
uint32_t a = (uint32_t) v.elems[i].intVal; <<< a = 0xcdcdcdcd
val.elems[i].intVal = emuHeap[a>>2]; <<< access violation
}
s->loadBuffer->append(val);
return;
}
default:
break;
In order to run an algorithm on batches of vectors, I'd like to be able to send a vector of pointers to arrays. Example:
Possible ?