Suggestion on transfer big chunk of memory between CPU and QPU?

We achieved 5+x speed up with QPULib for the main calculation part!

But we have to keep transfering data back and forth between CPU and QPU, which is consuming half of the total time.

Following the examples code, we can only do this by for{*(shared_array_ptr) = a;}, which seems too weak.

I think it's some fundermental operation that woth additional optimization. Any suggestion how to do it efficiently? Like what they do here: https://github.com/nineties/py-videocore/blob/f2a0ef174a936f7a6e11a9e24f34fb555acb84c7/videocore/assembler.py#L692

mn416 / QPULib