taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.42k stars 2.27k forks source link

How to do initialization for multiprocessing? #5944

Open sconlyshootery opened 2 years ago

sconlyshootery commented 2 years ago

For a program A where multiprocessing is conducted to run program B, it seems that I could only put ti.init() in B rather A, causing wasting a lot of time for initialization. Any suggestion?

jim19930609 commented 2 years ago

Hi sconlyshootery, In terms of "multiprocessing", I can think of different uses and each of them have different semantics regarding whether it should re-initialize the environments. May I ask for a simple example code to demonstrate how you're using multiprocessing?

The other question is how annoying it is if we have to re-initialize in B? Any possible numerical results in terms of the initialization latency?

sconlyshootery commented 2 years ago

Hi, thank you for your kind reply: I aim to produce depth maps from point cloud. I found taichi is really useful for this task, it will be 2 times faster than numba. By using it, it will produce a depth map from 1000000+ points at about 0.2 seconds, the initialization will cost 0.1 seconds. So I wonder if initialization could be done only once, the program will run 2 times faster. A simple example is here:

def main(args):
    pool = Pool(processes=args.mt_num)
    pool.map(projectPoints_ti, [pc1, pc2, pc3, ...])

def projectPoints_ti(pc, intrinsics, output_size):
    """
    pc: 3D points in world coordinates, 3*n
    intrinsics: 3 * 3
    output_size: depth image size (h, w)
    """
    """project to image coordinates"""
    pc = intrinsics @ pc  # 3*n
    pc = pc.T  # n*3
    pc[:, :2] = pc[:, :2] / pc[:, 2][..., np.newaxis]

    h, w = output_size

    ti.init(arch=ti.cpu)
    depth = ti.field(dtype=ti.f64, shape=(h, w))

    @ti.kernel
    def pcd2depth(pc: ti.types.ndarray()):

        """get depth"""
        for i in range(pc.shape[0]):
            # check if in bounds
            # use minus 1 to get the exact same value as KITTI matlab code
            x = int(ti.round(pc[i, 0]) - 1)
            y = int(ti.round(pc[i, 1]) - 1)
            z = pc[i, 2]
            if x < 0 or x >= w or y < 0 or y >= h or z <= 0.1:
                continue
            if depth[y, x] > 0:
                depth[y, x] = min(z, depth[y, x])
            else:
                depth[y, x] = z
    pcd2depth(pc)
    return depth.to_numpy()

I am new to taichi, I am not sure whether this is the best way to use it. Some tips are also welcomed.

jim19930609 commented 2 years ago

Hi sconlyshootery, Thanks for providing the example code!

For this use case, looks like each process is using the same kernel pcd2depth(), but with different pc (Ndarray) and depth (Field) types. In that case, Taichi will compile one kernel for each pc + depth combination - similar to how template functions are handled in C++, and then execute them. Since ti.init() does memory preallocation plus Taichi's compilation and kernel execution are not thread safe, we're likely gonna get data conflicts in the case of "init once, compile and execute with multiprocesses".

However, Taichi does have a way to parallelize the above mentioned compile and execute multiple kernels, by taking advantage of our Async Executor. For example, a psudocode for the same example but with Async Executor might look like:

def prepare_pc_and_hw(...):
    pc = intrinsics @ pc  # 3*n
    pc = pc.T  # n*3
    pc[:, :2] = pc[:, :2] / pc[:, 2][..., np.newaxis]
    h, w = output_size
    return pc, (h, w)

def main(args):
    pool = Pool(processes=args.mt_num)
    inputs = pool.map(prepare_pc_and_hw, [pc1, pc2, pc3, ...])

    @ti.kernel
    def pcd2depth(pc: ti.types.ndarray(), depth=ti.template()):
           ....

    # Start of Async Execution
    async_engine = ti.AsyncExecutor
    for pc, h, w in inputs:
        depth = ti.field(dtype=ti.f64, shape=(h, w))
        async_engine.submit(pcd2depth(pc, depth))
    async_engine.wait()
    ...

Basically, the idea is to put the preparation parts (prepare for pc and h, w to be used in creating depth) in Python's multiprocessing. After all the preparations' done, we switch to use Taichi's AsyncEngine to accelerate Taichi's compilation and kernel execution.

Let me know whether this approach fits your need. In addition, since Async Executor isn't something officially released yet, the above codes are seriously "psudo" codes. However, we can try to arrange sth working if you are interested in trying it out.

sconlyshootery commented 2 years ago

Hi, Jim. Thank you for your kind reply. My main concern is that preparation will produce too much data, causing the machine overloaded. I am very glad to try it out.

jim19930609 commented 2 years ago

Thanks! Let me also cc @ailzhang and @lin-hitonami since this has something to do with Async Engine, I guess we'll need some internal discussions first.

oliver-batchelor commented 1 year ago

AsyncExecutor does not seem to exist anymore - did it change name to something else? I'm trying to figure out how I'd use taichi from multiple threads.

jim19930609 commented 1 year ago

Hi oliver, We did deprecated the AsyncExecutor for now since it's not actively maintained. In some previous offline discussions, we did plan to add it back but there's few valid use cases for now.

Can you describe a little bit more about your task, and why multi-threading is important? Thanks in advance!

oliver-batchelor commented 1 year ago

Hi Zhanlue,

I can implement my own queue in a single thread which runs all Taichi operations asynchronously - but I saw this and imagined maybe this was a more efficient way to do it. For example data uploading (GPU) and kernel execution can be overlapped.

I have a couple of use cases: 1) A camera ISP pipeline where I'm grabbing frames off multiple cameras and processing them, at the moment I have a pool of threads processing camera images as there is some CPU work too 2) A GUI application where I have a mix of synchronous and fast operations (such as picking) as well as some slower long-running operations which might take a second or two, but the UI needs to remain responsive

I have found I can implement these both in Taichi very nicely (previously I used pytorch) - but I've yet to integrate them.

Cheers, Oliver

On Tue, Feb 7, 2023 at 2:44 PM Zhanlue Yang @.***> wrote:

Hi oliver, We did deprecated the AsyncExecutor for now since it's not actively maintained. In some previous offline discussions, we did plan to add it back but there's few valid use cases for now.

Can you describe a little bit more about your task, and why multi-threading is important? Thanks in advance!

— Reply to this email directly, view it on GitHub https://github.com/taichi-dev/taichi/issues/5944#issuecomment-1420046529, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAITRZLVJCZINTKUXWLHJF3WWGSINANCNFSM6AAAAAAQCEJEUQ . You are receiving this because you commented.Message ID: @.***>

jim19930609 commented 1 year ago

Hi Oliver, Thanks so much for providing us these use cases. Looks like it's gonna be AsyncEngine + Heterogeneous Support (able to execute kernels on different backends in the same run). Let me bring this topic to our Issue Triage Meeting this Friday. Thanks!