Open sconlyshootery opened 2 years ago
Hi sconlyshootery, In terms of "multiprocessing", I can think of different uses and each of them have different semantics regarding whether it should re-initialize the environments. May I ask for a simple example code to demonstrate how you're using multiprocessing?
The other question is how annoying it is if we have to re-initialize in B? Any possible numerical results in terms of the initialization latency?
Hi, thank you for your kind reply: I aim to produce depth maps from point cloud. I found taichi is really useful for this task, it will be 2 times faster than numba. By using it, it will produce a depth map from 1000000+ points at about 0.2 seconds, the initialization will cost 0.1 seconds. So I wonder if initialization could be done only once, the program will run 2 times faster. A simple example is here:
def main(args):
pool = Pool(processes=args.mt_num)
pool.map(projectPoints_ti, [pc1, pc2, pc3, ...])
def projectPoints_ti(pc, intrinsics, output_size):
"""
pc: 3D points in world coordinates, 3*n
intrinsics: 3 * 3
output_size: depth image size (h, w)
"""
"""project to image coordinates"""
pc = intrinsics @ pc # 3*n
pc = pc.T # n*3
pc[:, :2] = pc[:, :2] / pc[:, 2][..., np.newaxis]
h, w = output_size
ti.init(arch=ti.cpu)
depth = ti.field(dtype=ti.f64, shape=(h, w))
@ti.kernel
def pcd2depth(pc: ti.types.ndarray()):
"""get depth"""
for i in range(pc.shape[0]):
# check if in bounds
# use minus 1 to get the exact same value as KITTI matlab code
x = int(ti.round(pc[i, 0]) - 1)
y = int(ti.round(pc[i, 1]) - 1)
z = pc[i, 2]
if x < 0 or x >= w or y < 0 or y >= h or z <= 0.1:
continue
if depth[y, x] > 0:
depth[y, x] = min(z, depth[y, x])
else:
depth[y, x] = z
pcd2depth(pc)
return depth.to_numpy()
I am new to taichi, I am not sure whether this is the best way to use it. Some tips are also welcomed.
Hi sconlyshootery, Thanks for providing the example code!
For this use case, looks like each process is using the same kernel pcd2depth()
, but with different pc
(Ndarray) and depth
(Field) types. In that case, Taichi will compile one kernel for each pc
+ depth
combination - similar to how template functions are handled in C++, and then execute them. Since ti.init()
does memory preallocation plus Taichi's compilation and kernel execution are not thread safe, we're likely gonna get data conflicts in the case of "init once, compile and execute with multiprocesses".
However, Taichi does have a way to parallelize the above mentioned compile and execute multiple kernels
, by taking advantage of our Async Executor. For example, a psudocode for the same example but with Async Executor might look like:
def prepare_pc_and_hw(...):
pc = intrinsics @ pc # 3*n
pc = pc.T # n*3
pc[:, :2] = pc[:, :2] / pc[:, 2][..., np.newaxis]
h, w = output_size
return pc, (h, w)
def main(args):
pool = Pool(processes=args.mt_num)
inputs = pool.map(prepare_pc_and_hw, [pc1, pc2, pc3, ...])
@ti.kernel
def pcd2depth(pc: ti.types.ndarray(), depth=ti.template()):
....
# Start of Async Execution
async_engine = ti.AsyncExecutor
for pc, h, w in inputs:
depth = ti.field(dtype=ti.f64, shape=(h, w))
async_engine.submit(pcd2depth(pc, depth))
async_engine.wait()
...
Basically, the idea is to put the preparation parts (prepare for pc
and h
, w
to be used in creating depth
) in Python's multiprocessing. After all the preparations' done, we switch to use Taichi's AsyncEngine to accelerate Taichi's compilation and kernel execution.
Let me know whether this approach fits your need. In addition, since Async Executor isn't something officially released yet, the above codes are seriously "psudo" codes. However, we can try to arrange sth working if you are interested in trying it out.
Hi, Jim. Thank you for your kind reply. My main concern is that preparation will produce too much data, causing the machine overloaded. I am very glad to try it out.
Thanks! Let me also cc @ailzhang and @lin-hitonami since this has something to do with Async Engine, I guess we'll need some internal discussions first.
AsyncExecutor
does not seem to exist anymore - did it change name to something else? I'm trying to figure out how I'd use taichi from multiple threads.
Hi oliver,
We did deprecated the AsyncExecutor
for now since it's not actively maintained. In some previous offline discussions, we did plan to add it back but there's few valid use cases for now.
Can you describe a little bit more about your task, and why multi-threading is important? Thanks in advance!
Hi Zhanlue,
I can implement my own queue in a single thread which runs all Taichi operations asynchronously - but I saw this and imagined maybe this was a more efficient way to do it. For example data uploading (GPU) and kernel execution can be overlapped.
I have a couple of use cases: 1) A camera ISP pipeline where I'm grabbing frames off multiple cameras and processing them, at the moment I have a pool of threads processing camera images as there is some CPU work too 2) A GUI application where I have a mix of synchronous and fast operations (such as picking) as well as some slower long-running operations which might take a second or two, but the UI needs to remain responsive
I have found I can implement these both in Taichi very nicely (previously I used pytorch) - but I've yet to integrate them.
Cheers, Oliver
On Tue, Feb 7, 2023 at 2:44 PM Zhanlue Yang @.***> wrote:
Hi oliver, We did deprecated the AsyncExecutor for now since it's not actively maintained. In some previous offline discussions, we did plan to add it back but there's few valid use cases for now.
Can you describe a little bit more about your task, and why multi-threading is important? Thanks in advance!
— Reply to this email directly, view it on GitHub https://github.com/taichi-dev/taichi/issues/5944#issuecomment-1420046529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITRZLVJCZINTKUXWLHJF3WWGSINANCNFSM6AAAAAAQCEJEUQ . You are receiving this because you commented.Message ID: @.***>
Hi Oliver,
Thanks so much for providing us these use cases. Looks like it's gonna be AsyncEngine
+ Heterogeneous Support (able to execute kernels on different backends in the same run)
. Let me bring this topic to our Issue Triage Meeting this Friday. Thanks!
For a program A where multiprocessing is conducted to run program B, it seems that I could only put ti.init() in B rather A, causing wasting a lot of time for initialization. Any suggestion?