starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
63 stars 12 forks source link

How to replace "malloc " in C language with "starpu_malloc"? #32

Closed WwwwwYyyy closed 6 months ago

WwwwwYyyy commented 10 months ago

Dear Professor, void pangulu_malloc(int_t size){
CPU_MEMORY += size; void
malloc_address=NULL; malloc_address=(void *)malloc(size); if(malloc_address==NULL){ printf("error ------------ don't have cpu memory\n"); }
return malloc_address; }

The above is the original code, now I want to use starpu_malloc to replace the "malloc" in the fourth line, the way I change is as follows:

void *pangulu_malloc(int_t size){

    CPU_MEMORY += size;
    void *malloc_address=NULL;
    starpu_malloc((void**)&malloc_address, size);
    if(malloc_address==NULL){
        printf("error ------------ don't have cpu memory\n");
    }

    //return malloc_address;
}

But there is this “free(): invalid pointer” error, how should I change it correctly?

nfurmento commented 10 months ago

Hello,

we would need more information on where and how the error happens. But i guess you just did not replace the call to free by starpu_free()

Nathalie

WwwwwYyyy commented 10 months ago

Thank you! I have solved the problem above, but this new problem arises. When I run a 150000 matrix, but there is no problem running a 15000 matrix。Why is this ? b68c7a3766c75d7251ab041f6929cac

nfurmento commented 10 months ago

Because your allocated memory is too high and your code has accessed some memory outside the allocated ones .... it does not seem to be a StarPU-related problem. Maybe try to run your CUDA code without StarPU and try to fix it before running it with StarPU.

WwwwwYyyy commented 10 months ago

Now that the above error has been resolved, the three diagrams below show the way we allocate memory space, the way we register the data, and the warnings that appear, and the performance is very low may I ask why this is?

image image image
nfurmento commented 10 months ago

It is impossible to answer your question without more information. So, as i advised you before, please read https://files.inria.fr/starpu/doc/html/OfflinePerformanceTools.html#OfflinePerformanceFeedback to find out how to analyze the performance of a program.

And also please refrain to provide your code as a image, copy and paste the relevant lines in the body of your message.

sthibaul commented 10 months ago

@WwwwwYyyy your starpu_memory_pin call is wrong: you need to pass it the pointer, not the handle. Also, starpu_malloc does not allocate on the gpu, it allocates in the cpu memory (and does the pinning, so it's useless to call starpu_memory_pin after that)

sthibaul commented 10 months ago

So that your registration is probably wrong as well: since what starpu_malloc() allocates is CPU memory, if you register that as a CUDA pointer (passing the cuda not to starpu_vector_data_register), you have a back&forth between the cpu and the gpu, cuda making the cpu pointer "transparently" used on the gpu, but then slowly.

Also, as nathalie said, better use tracing tools to check what is happening, to determine where performance is lost

WwwwwYyyy commented 9 months ago

Thank you!I have a question, after I modify, through the "htop" instruction to monitor the process found that I obviously run 4 processes, but the actual display is much more than four processes, this is my run instruction,

"STARPU_SCHED=dmda STARPU_NCUDA=4 STARPU_NCPU=1 STARPU_WORKERS_NOBIND=1 OPENBLAS_NUM_THREADS=1 mpirun -n 4 ./program -NB 200 -P 2 -Q 2 -F /root/matrix/cz10228.mtx"

I have set STARPU_NCPU = 1, may I ask why this is the case?

WwwwwYyyy commented 9 months ago

@sthibaul Is there something wrong with my runtime instruction?

WwwwwYyyy commented 9 months ago

06c8203e663a28d12732b016ef65c99 That's what the "htop" command monitors when I'm running four processes.

sthibaul commented 9 months ago

You are telling mpi to run 4 instances of starpu, and to each starpu you are telling to use 4 gpus, so that needs 16 threads total to drive the gpus, plus the 4 main threads, plus the CUDA threads, etc.

Why are you using mpirun? Can't you just run starpu without mpi? It will drive all the gpus of the system just fine, moving data around as needed.

WwwwwYyyy commented 9 months ago

Because I'm running the program in a distributed environment.Since I used "starpu_mpi_init_conf" to initialize StarPU, so I used mpirun to start the program.But the performance isn't as good as programs that don't use StarPU, and I don't know why.And I adjusted STARPU_NCUDA to 1 and there was no change in performance.

sthibaul commented 9 months ago

Because I'm running the program in a distributed environment

This is not getting distributed, your MPI invocation is putting all processes on the same node...

the performance isn't as good as programs that don't use StarPU

I guess that programs that don't use StarPU don't occupy all GPUs & cores of the machine like StarPU does.

I adjusted STARPU_NCUDA to 1 and there was no change in performance.

But if you really (do you really??) want to use MPI to run several instances of StarPU on the same node (and really, I have to insist there is really little probability that it's actually what you want to do), you have to tell each instance of StarPU which GPU it should be using. Otherwise they'll just all use the same GPU, and thus compete for using it.

But again, I STRONGLY believe that running 4 StarPU processes on the same node is not what you want to do. If you have only one node, just run one instance of StarPU. I.e. without mpi or with -np 1.

WwwwwYyyy commented 9 months ago

Sorry I didn't make myself clear before, I end up testing program performance on a distributed environment, but only on one node at the moment. Earlier you said that starpu_malloc allocates space on main memory, I wanted to ask if there's any other way to allocate space directly on the GPU besides it?

sthibaul commented 9 months ago

a distributed environment, but only on one node at the moment

Ok, but on a single-node, you cannot hope for acceleration, the computing power will not increase by using several MPI processes, on the contrary you will end up with oversubscription.

Really, on a single node you want -np 1

Earlier you said that starpu_malloc allocates space on main memory, I wanted to ask if there's any other way to allocate space directly on the GPU besides it?

You can use starpu_malloc_on_node_flags to allocate on a given node, but usually you don't need to do that because e.g. your data is generated by the cpu or from the disk so you'd get it in ram first, and you should rather let starpu do the transfer for you because it will be able to pipeline transfers etc. Also, when using several gpus, you don't usually want to have to specificy with gpu to use, and you'd rather let the scheduler decide for you.