intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.25k stars 738 forks source link

failed to allocate memory when using malloc_device #3220

Closed SimonWang9610 closed 6 days ago

SimonWang9610 commented 3 years ago
template<typename T>
class Linear {
    private:
       T* weight;
       T* input;
       T* result;
       T* bias;
       T* dz;
       const int M;
       const int N;
       const int K;
    public:
       Linear(T* x, T* r, int m, int n, intk,  queue& Q): input(x), result(r), M(m), N(n), K(k) {
           weight = malloc_device<T>(M * N, Q);
           bias = malloc_device<T>(M, Q);
           dz = malloc_device<T>(N * K, Q);
       }
    ...

x = malloc_device<T>(N * K, Q);, r = malloc_device<T>(M * K, Q); In my codes, when I have multiple Linear instances sequently, all of them can allocate successful for weight and bias. However, only the last Linear instance can allocate successfully for dz, others failed to allocate for dz and return 0. (dz == nullptr is true). I use dz for storing temporary result in each Linear.

Furthermore, if I change my code and put dz int member function of Linear, like bellow:

T* update(T* diff, queue& Q) {
    T* dz = malloc_device<T>(N * K, Q); // also failed to allocate
    T* dw = malloc_device<T>(M * N, Q); // but dw always allocate successfully
    /* events here*/
   ...
   free(dw, Q);
   return dz;
}

I call update in a for loop:

T* diff = inputs.back(); // all elements in inputs are allocated by malloc_device
for (auto linear = layers.rbegin(); linear != layers.rend(); linear++) {
    diff = linear->update(diff, Q);
 }
free(diff, Q);

if I change the above code as:

void update(T* diff, queue& Q) {
    T* dz = malloc_device<T>(N * K, Q); // also failed to allocate
    T* dw = malloc_device<T>(M * N, Q); // but dw always allocate successfully
    /* events here*/
   ...
   free(dw, Q);
   free(diff, Q);
   Q.memcpy(diff, dz, N * K * sizeof(T)).wait();
  free(dz, Q);
}

then, call it as:

T* diff = inputs.back(); // all elements in inputs are allocated by malloc_device
for (auto linear = layers.rbegin(); linear != layers.rend(); linear++) {
    linear->update(diff, Q);
 }
free(diff, Q);

all circumstances only can allocate successfully dz for the last Linear, others will get 0 (nullptr);

This problem has driven me crazy! Hope guys can explain why it happened! Thanks! I tested those codes in Windows 10 on Intel CPU, using oneAPI toolkit and Ubuntu 18.04 on Nvidia GPU. They gave me same errors.

romanovvlad commented 3 years ago

@SimonWang9610 Could you please provide a short reproducer?

SimonWang9610 commented 3 years ago

@SimonWang9610 Could you please provide a short reproducer? Yeah, in my older solution, no matter how I create dz, it always returned 0. I tried to print (std::cout) every member in Linear, the result always showed dz = 0, other members show they have different addresses. Like below (I put dz, dw as members of Linear):

[0]: weight: 000002B9B9F40C00, bias: 000002B9B9F40C80, dz: 0000000000000000, dw: 000002B9B9F40D80
[1]: weight: 000002B9B9F40E00, bias: 000002B9B9F40F00, dz: 0000000000000000, dw: 000002B9B9F40F80
[2]: weight: 000002B9B9F41080, bias: 000002B9B9F41100, dz: 000002B9B9F41200, dw: 000002B9B9F41280

Today, I create a new solution with same codes in VS 2019. and I found it works (dz can be allocated correctly. The address it points to is not 0 any more). However, I am still confused why the old solution failed. Besides, I also found the program still can access the elements of v = malloc_device<T>(SIZE, Q)after free(v, Q). It might because I have little understanding of USM in SYCL. If I will provide more details for you to fix this issue if I found some new things.

SimonWang9610 commented 3 years ago

@SimonWang9610 Could you please provide a short reproducer?

I found why it always failed to allocate although I am not clear why it happened. In my old solution, Linear have a member `bool end. Like this:

template<typename T>
class Linear {
    private:
       T* weight;
       T* input;
       T* result;
       T* bias;
       T* dz;
       const int M;
       const int N;
       const int K; 
       bool end;
    public:
       Linear(T* x, T* r, int m, int n, int k, bool e, queue& Q): input(x), result(r), M(m), N(n), K(k), end(e) {
           weight = malloc_device<T>(M * N, Q);
           bias = malloc_device<T>(M, Q);
           dz = malloc_device<T>(N * K, Q);
       }
    ...

When I delete end from Linear, all allocations work correctly. Besides, if I change bool end as const int end in Linear, it also works well. Hope it can help you a lot.

romanovvlad commented 3 years ago

@SimonWang9610 Thanks, but could you please provide a small self-contained .cpp file which shows the problem so we can compile and run it on our side?

SimonWang9610 commented 3 years ago

@romanovvlad my code I have no ideas how to put my files here, so I post the above link. All codes in the above link can pass compilation and run as I expected on Ubuntu 18.04, Nvidia GPU. (I failed to configure CPU on Ubuntu, so currently I only can use GPU).

And I am sure the reason why it allocates wrongly is because I set bool end and use it in SYCL kernel functions. If I change it as int or const int, no such problems in my codes.

Besides, I got another problem on Windows. I run the codes in the above link on Windows Intel CPU, but the dz, dw and diff pointers (allocated by malloc_device) always have undefined behaviour (point to the numbers not as I expected).

I mean, the total same codes (in the above link) gave me different results while running on Ubuntu 18.04 Nvidia GPU (works well) and Windows Intel CPU based on Intel base toolkit (works wrongly)

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

pthorali commented 1 month ago

Hi Team, I am facing the same error, the device I am testing on is ARC A770. I have attached a test.zip which contains a test.cpp file to replicate the issue. Could you please guide on how to fix the issue, as there is no exception raised by the sycl::malloc_device? test.zip

Thank you, Preethi Raksha

pthorali commented 3 weeks ago

Response received via https://github.com/intel/llvm/issues/15691 , Unable to close the issue.

0x12CC commented 3 weeks ago

@SimonWang9610, the allocation function you're using returns a nullptr if there are not enough resources to allocate the requested memory. If you're using the L0 adapter, you can use this SYCL extension to query the available memory. Alternatively, you can try reducing the size of your allocations. Could you please verify if this solves your issue?

AlexeySachkov commented 6 days ago

I could not reproduce the original issue reported here and therefore, I will close this tracker for now. If any of the issues are still there, please open a new tracker and we will take a look