Closed lyogavin closed 4 years ago
We achieved 5+x speed up with QPULib for the main calculation part!
But we have to keep transfering data back and forth between CPU and QPU, which is consuming half of the total time.
Following the examples code, we can only do this by for{*(shared_array_ptr) = a;}, which seems too weak.
I think it's some fundermental operation that woth additional optimization. Any suggestion how to do it efficiently? Like what they do here: https://github.com/nineties/py-videocore/blob/f2a0ef174a936f7a6e11a9e24f34fb555acb84c7/videocore/assembler.py#L692
Looks like we can directly memcpy the arm_base pointer of the mmap'd memory. about 5x faster than copy by loop.
We achieved 5+x speed up with QPULib for the main calculation part!
But we have to keep transfering data back and forth between CPU and QPU, which is consuming half of the total time.
Following the examples code, we can only do this by for{*(shared_array_ptr) = a;}, which seems too weak.
I think it's some fundermental operation that woth additional optimization. Any suggestion how to do it efficiently? Like what they do here: https://github.com/nineties/py-videocore/blob/f2a0ef174a936f7a6e11a9e24f34fb555acb84c7/videocore/assembler.py#L692