starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
58 stars 13 forks source link

Is there a high-level API to mark data in handle as "dirty" without deallocating it by `starpu_invalidate_submit`? #35

Closed Muxas closed 5 months ago

Muxas commented 5 months ago

Is your feature request related to a problem? Please describe. I am developing a software to train neural networks. Training is done through iterations. Each iteration generates temporary data and consumes it. You can think of it as STARPU_SCRATCH buffer within an iteration. Therefore, it would be beneficial to tell this information to StarPU. Current solutions are starpu_data_wont_use and starpu_data_invalidate_submit. However, both solution do not fit my use case: starpu_data_wont_use hints to offload data from GPU to CPU (while the data is not reused and copying it back to CPU just wastes bandwidth), while starpu_data_invalidate_submit deallocates memory buffers, which require unpinning them at first. Either data is written back to CPU or memory allocation and pinning-unpinning is done. Doing pin-unpin is very bad, as it drops performance of my application by 10 times.

Describe the solution you'd like I would like to provide some hint to StarPU that will mark data as "dirty" or "uninitialized" without doing anything else -- no copy and no deallocation.

sthibaul commented 5 months ago

starpu_data_invalidate_submit deallocates memory buffers

? The allocation cache is supposed to kick in here to avoid the dealloc/alloc.

We can indeed introduce a function to achieve only the uninitialize part and not the deallocation part, but the buffers will be marked as clean and will be quickly reused or deallocated if anything else wants some data.

Muxas commented 5 months ago

I tried starpu_data_invaludate_submit with StarPU-1.3.10 long time ago, probably summer 2023. Cannot remember actual numbers, but performance with starpu_data_invalidate_submit was several times lower compared to no such calls. I saw in nvidia-smi how memory constantly jumps up and down with a hight frequency and the trace was full of allocations/deallocations and pinning/unpinning memory. Is it different in StarPU-1.4?

sthibaul commented 5 months ago

It's not supposed to be doing this except if it does not manage to reuse the buffer for another data of the same size. Possibly there was a bug preventing from buffer reuse? (which we'd want to fix anyway)

starpu_data_invalidate_submit by itself doesn't tell starpu to really free the buffer, it only tells it to put it on the list available for reuse (or for freeing if it's a different shape of buffer that is needed).

Muxas commented 5 months ago

Following documentation, the last line of https://files.inria.fr/starpu/doc/html/AdvancedDataManagement.html#DataManagementAllocation

the buffers containing the current value will then be freed,
and reallocated only when another task writes some value to the handle.

I made a conclusion, that starpu_data_invalidate_submit causes deallocation and unpinning.

sthibaul commented 5 months ago

Yes, starpu_data_invalidate_submit does cause deallocation from the point of view of the handle, but that doesn't cause an actual deallocation because starpu uses an allocation cache, so it just puts the buffer in the cache.

That being said, since it was easy I now have added starpu_data_deinitialize{,_submit} that doesn't push to the cache, so the handle has higher chances of not seeing its buffers reused for another handle.

Muxas commented 5 months ago

Tried latest starpu-1.3 commit and found out, that using starpu_data_invalidate_submit works, while starpu_data_deinitialize_submit segfaults my program. Here is backtrace (I will recompile StarPU in debug mode if needed, as some outputs are ):

/trinity/home/al.mikhalev/Install/starpu-1.3-a100/lib/libstarpu-1.3.so.10(_starpu_select_src_node+0x44a)[0x15548dea857a]
/trinity/home/al.mikhalev/Install/starpu-1.3-a100/lib/libstarpu-1.3.so.10(_starpu_create_request_to_fetch_data+0xd71)[0x15548dea9711]
/trinity/home/al.mikhalev/Install/starpu-1.3-a100/lib/libstarpu-1.3.so.10(starpu_memchunk_tidy+0x721)[0x15548debce11]
/trinity/home/al.mikhalev/Install/starpu-1.3-a100/lib/libstarpu-1.3.so.10(___starpu_datawizard_progress+0x29)[0x15548deaff49]
/trinity/home/al.mikhalev/Install/starpu-1.3-a100/lib/libstarpu-1.3.so.10(__starpu_datawizard_progress+0x2a7)[0x15548deb0277]
/trinity/home/al.mikhalev/Install/starpu-1.3-a100/lib/libstarpu-1.3.so.10(_starpu_cuda_driver_run_once+0x4f5)[0x15548df12a95]
/trinity/home/al.mikhalev/Install/starpu-1.3-a100/lib/libstarpu-1.3.so.10(_starpu_cuda_worker+0x95)[0x15548df131f5]
/lib64/libpthread.so.0(+0x7ea5)[0x15555511dea5]
/lib64/libc.so.6(clone+0x6d)[0x155554535b0d]

[starpu][_starpu_select_src_node][assert failure] The data for the handle 0x55555cdcc950 is requested, but the handle does not have a valid value. Perhaps some initialization task is missing?

python3: ../../src/datawizard/coherency.c:68: _starpu_select_src_node: Assertion `0 && "src_node_mask != 0"' failed.
Missing separate debuginfos, use: debuginfo-install nvidia-driver-latest-NVML-550.54.14-1.el7.x86_64 nvidia-driver-latest-cuda-libs-550.54.14-1.el7.x86_64
--Type <RET> for more, q to quit, c to continue without paging--

Thread 5 "CUDA 0" received signal SIGABRT, Aborted.
[Switching to Thread 0x15543e358700 (LWP 2153219)]
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
0x000015555446d387 in raise () from /lib64/libc.so.6
(gdb) bt full
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
Python Exception <type 'exceptions.ImportError'> No module named gdb.frames: 
#0  0x000015555446d387 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x000015555446ea78 in abort () from /lib64/libc.so.6
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
#2  0x00001555544661a6 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
#3  0x0000155554466252 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
#4  0x000015548dea85ca in _starpu_select_src_node (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
handle=handle@entry=0x55555cdcc950, destination=destination@entry=0) at ../../src/datawizard/coherency.c:68
        src_node = -1
        i = <optimized out>
        nnodes = 5
        node = <optimized out>
        size = 83886080
        cost = inf
        src_node_mask = <optimized out>
        __func__ = "_starpu_select_src_node"
        __PRETTY_FUNCTION__ = "_starpu_select_src_node"
        i_ram = <optimized out>
        i_gpu = <optimized out>
        i_disk = <optimized out>
#5  0x000015548dea9711 in _starpu_create_request_to_fetch_data (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
handle=handle@entry=0x55555cdcc950, dst_replicate=0x55555cdcca50, mode=mode@entry=STARPU_R, 
    is_prefetch=is_prefetch@entry=STARPU_IDLEFETCH, async=async@entry=1, callback_func=callback_func@entry=0x0, callback_arg=0x0, prio=0, origin=0x15548df32053 "starpu_memchunk_tidy")
    at ../../src/datawizard/coherency.c:552
        requesting_node = <optimized out>
        nwait = <optimized out>
        __PRETTY_FUNCTION__ = "_starpu_create_request_to_fetch_data"
        src_node = -1
        src_nodes = {0, 0, 5120, 0}
        dst_nodes = {0, 0, 0, 0}
        handling_nodes = {0, 0, 0, 0}
        write_invalidation = <optimized out>
        nhops = <optimized out>
        requests = 0x0
        reused_requests = 0x0
        hop = <optimized out>
#6  0x000015548debce11 in starpu_memchunk_tidy (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
node=node@entry=1) at ../../src/datawizard/memalloc.c:1232
        handle = 0x55555cdcc950
        target_node = <optimized out>
        mc = <optimized out>
        orig_next_mc = 0x1552aa45eb40
        next_mc = 0x1552aa45eb40
        skipped = <optimized out>
        total = <optimized out>
        available = <optimized out>
        target = <optimized out>
        amount = <optimized out>
        __PRETTY_FUNCTION__ = "starpu_memchunk_tidy"
        __func__ = "starpu_memchunk_tidy"
        warned = 0
#7  0x000015548deaff49 in ___starpu_datawizard_progress (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
memory_node=memory_node@entry=1, may_alloc=may_alloc@entry=1, push_requests=push_requests@entry=1)
--Type <RET> for more, q to quit, c to continue without paging--
    at ../../src/datawizard/datawizard.c:43
        ret = 0
#8  0x000015548deb0277 in __starpu_datawizard_progress (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
may_alloc=may_alloc@entry=1, push_requests=push_requests@entry=1) at ../../src/datawizard/datawizard.c:101
        worker = <optimized out>
        memnode = 1
        __PRETTY_FUNCTION__ = "__starpu_datawizard_progress"
        current_worker_id = 0
        ret = 0
        nnodes = 5
#9  0x000015548df12a95 in _starpu_cuda_driver_run_once (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
worker_set=worker_set@entry=0x15548e196ba0 <cuda_worker_set>) at ../../src/drivers/cuda/driver_cuda.c:963
        worker0 = <optimized out>
        tasks = 0x15543e357d60
        task = <optimized out>
        j = <optimized out>
        i = 1
        res = 1
        idle_tasks = 1
        idle_transfers = 0
        __func__ = "_starpu_cuda_driver_run_once"
        __PRETTY_FUNCTION__ = "_starpu_cuda_driver_run_once"
#10 0x000015548df131f5 in _starpu_cuda_worker (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
_arg=0x15548e196ba0 <cuda_worker_set>) at ../../src/drivers/cuda/driver_cuda.c:1104
        worker_set = 0x15548e196ba0 <cuda_worker_set>
        i = <optimized out>
#11 0x000015555511dea5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing: 
#12 0x0000155554535b0d in clone () from /lib64/libc.so.6
No symbol table info available.
sthibaul commented 5 months ago

Ah, it's trying to writeback the data to make room on the gpu, but there is no data to save any more of course :) I have now added a call to mark the memchunk as already clean.

Muxas commented 5 months ago

And now there is a compilation error of starpu-1.3 branch:

  CC       libstarpu_1.3_la-memalloc.lo
../../src/datawizard/memalloc.c: In function '_starpu_memchunk_clean':
../../src/datawizard/memalloc.c:1718:44: error: implicit declaration of function '_starpu_get_node_struct'; did you mean '_starpu_get_worker_struct'? [-Werror=implicit-function-declaration]
 1718 |         struct _starpu_node *node_struct = _starpu_get_node_struct(node);
      |                                            ^~~~~~~~~~~~~~~~~~~~~~~
      |                                            _starpu_get_worker_struct
../../src/datawizard/memalloc.c:1718:44: warning: initialization of 'struct _starpu_node *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
In file included from ../../src/datawizard/coherency.h:25,
                 from ../../src/datawizard/memory_nodes.h:24,
                 from ../../src/datawizard/memalloc.c:19:
../../src/datawizard/memalloc.c:1719:39: error: invalid use of undefined type 'struct _starpu_node'
 1719 |         _starpu_spin_lock(&node_struct->mc_lock);
      |                                       ^~
../../src/common/starpu_spinlock.h:127:28: note: in definition of macro '_starpu_spin_lock'
  127 |         __starpu_spin_lock(lock, __FILE__, __LINE__, __starpu_func__)
      |                            ^~~~
../../src/datawizard/memalloc.c:1722:28: error: invalid use of undefined type 'struct _starpu_node'
 1722 |                 node_struct->mc_clean_nb++;
      |                            ^~
In file included from ../../src/datawizard/coherency.h:25,
                 from ../../src/datawizard/memory_nodes.h:24,
                 from ../../src/datawizard/memalloc.c:19:
../../src/datawizard/memalloc.c:1725:41: error: invalid use of undefined type 'struct _starpu_node'
 1725 |         _starpu_spin_unlock(&node_struct->mc_lock);
      |                                         ^~
../../src/common/starpu_spinlock.h:131:30: note: in definition of macro '_starpu_spin_unlock'
  131 |         __starpu_spin_unlock(lock, __FILE__, __LINE__, __starpu_func__)
      |                              ^~~~
sthibaul commented 5 months ago

right, sorry, should have checked, now fixed.

Muxas commented 5 months ago

yes, it compiles now

Muxas commented 5 months ago

Thank you! I have checked it both for StarPU-1.4 and StarPU-1.3 and both versions work. I believe this issue can be closed as resolved. However, you might want to keep it open until it is added to documentation.

sthibaul commented 5 months ago

good, thanks!