rjhogan / Adept-2

Combined array and automatic differentiation library in C++

http://www.met.reading.ac.uk/clouds/adept/

Apache License 2.0

163 stars 29 forks source link

Memory usage problem on Windows with MinGW 64 #9

Closed jontio closed 5 years ago

jontio commented 5 years ago

On Windows with MinGW 64bit memory usage problem.

This will be ok...

    //memory ok
    //5MB
    aVector yy=x*2;
    for(int q=0;q<30000000;q++)
    {
        stack.new_recording();
        yy=x*2;
    }

But this memory usage will go through the roof. Even after Stack goes out of scope the memory is not freed up...

    //crazy memory usage and super slow
    //475MB and doesn't free even once
    //stack is out of scope
    for(int q=0;q<30000000;q++)
    {
        stack.new_recording();
        aVector yy=x*2;
    }

On Linux I had no such issue. I've tried various setting. I thought it might be something in the Packet.h file with freeing of memory but I don't understand what's going on in that file. Trying various compiler flags made no difference. Not building the library and instead using the method for non unix systems made no difference either. I use msys2 so my Windows system behalves more like that of Linux than Windows.

Below are what Stack said before and after the previous code. On Linux The result from Stack was about the same and nothing looks odd...

When stack is created

Automatic Differentiation Stack (address 0x22fa90):
   Currently attached - thread safe
   Recording status:
      Recording is ON
      0 statements (1048576 allocated) and 0 operations (1048576 allocated)
      0 gradients currently registered and a total of 1 needed (current index 0)
      Gradient list has no gaps
   Computation status:
      0 gradients assigned (0 allocated)
      Jacobian size: 0x0
      Independent indices:
      Dependent indices:  
      Parallel Jacobian calculation can use up to 2 threads
      Each thread treats 4 (in)dependent variables

After the 475MB memory thing has happened

Automatic Differentiation Stack (address 0x22fa90):
   Currently attached - thread safe
   Recording status:
      Recording is ON
      1 statements (1048576 allocated) and 1 operations (1048576 allocated)
      42 gradients currently registered and a total of 70 needed (current index 69)
      Gradient list has 1 gaps (12-38 )
   Computation status:
      0 gradients assigned (0 allocated)
      Jacobian size: 0x0
      Independent indices:
      Dependent indices:  
      Parallel Jacobian calculation can use up to 2 threads
      Each thread treats 4 (in)dependent variables

Below are all the outputs of the settings from settings.h for one of my tests...

version 2.0.5
compiler_version g++ [7.2.0]
compiler_flags
-g1 -O3 -D__unix__ -march=native
configuration
Adept version 2.0.5:
  Compiled with g++ [7.2.0]
  Compiler flags "-g1 -O3 -D__unix__ -march=native"
  BLAS support from blas library
  Jacobians processed in blocks of size 4

have_matrix_multiplication 1

Using make check one test always fail on Windows for me and that's ...

test_thread_safe_arrays... ./run_tests.sh: line 18:  4944 Segmentation fault      ./$TEST >> $LOG 2> $STDERR

In the test_results.txt file there is nothing under the test_thread_safe_arrays section...

########################################################
### test_thread_safe_arrays
########################################################

test_results.txt

All other tests passed.

Cheers, Jonti

rjhogan commented 5 years ago

Hi Jonti,

Looking at Packet.h, which includes the alloc_alligned and free_aligned functions, I can see that there is a bug in the case that neither _POSIX_VERSION nor _MSC_VER are defined: in this case alloc_aligned uses the "new" function to allocate the memory, but it is not freed! Could you please edit the free_aligned function with the two additional lines below and let me know if this helps?

Thanks,

Robin.

template <typename Type>
inline
void free_aligned(Type* data) {
  // Note that we need to use the same condition as used in
  // alloc_aligned() in order that new[] is followed by delete[]
  // and posix_memalign is followed by free
  if (Packet<Type>::alignment_bytes < sizeof(void*)) {
delete[] data;
  }
  else {

ifdef _POSIX_VERSION

if _POSIX_VERSION >= 200112L

free(data);

else

delete[] data;

endif

elif defined(_MSC_VER)

_aligned_free(data);

else // NEW LINE

delete[] data; // NEW LINE

endif

}
}

jontio commented 5 years ago

Hi Robin,

I replaced the free_aligned function with the one you gave and I can confirm that fixes the memory usage problem. Thanks.

The test_thread_safe_arrays still fails on my Windows system so that looks like a different issue and was a bit of a red herring.

(gdb) run
Starting program: E:\git\Adept-2\test\test_thread_safe_arrays.exe
[New Thread 4996.0x1120]
[New Thread 4996.0xb54]

Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to Thread 4996.0xb54]
0x0000000000405ea4 in ?? ()
(gdb)

Cheers, Jonti

rjhogan commented 5 years ago

Hi,

Glad it worked - I've pushed to github.

Could you check that when you run test/test_thread_safe_arrays on its own you get no output? I get the following:

Storage is not thread safe: using soft_link() 8 processors available running maximum of 8 threads Parallel subsetting of array zillions of times was successful

If I uncomment the "#define ADEPT_STORAGE_THREAD_SAFE 1" line near the top of test_thread_safe_arrays.cpp, and recompile, I get

Storage should be thread safe 8 processors available running maximum of 8 threads Parallel subsetting of array zillions of times was successful

Could you let me know what happens in this second case?

Thanks, Robin.

jontio commented 5 years ago

Hi Robin,

I had to add <<std::flush to the std::cout lines in test_thread_safe_arrays.cpp to see any output before the segmentation fault. I think this is due to buffering of iostreams with MinGW. If a program crashes you don't get to see the the output without std::flush. After adding that I get the following with ADEPT_STORAGE_THREAD_SAFE not defined...

Storage is not thread safe: using soft_link()
2 processors available running maximum of 2 threads
Segmentation fault

and then with ADEPT_STORAGE_THREAD_SAFE defined...

Storage should be thread safe
2 processors available running maximum of 2 threads
Segmentation fault

Looking where the segmentation fault happens it's in...

// The following almost always causes a crash if the code is not
// properly thread safe
#pragma omp parallel for
for (int i = 0; i < N*1000; ++i)
{
    for (int j = 0; j < N*1000; ++j)
    {
        B[j % N] = noalias(B(__,j)) + T.diag_vector();
    }
}

The following one also causes the fault...

#pragma omp parallel for
for (int i = 0; i < N*1000; ++i)
{
    for (int j = 0; j < N*1000; ++j)
    {
        B[j % N] = noalias(B(__,j)) ;
    }
}

But this following one is ok...

#pragma omp parallel for
for (int i = 0; i < N*1000; ++i)
{
    for (int j = 0; j < N*1000; ++j)
    {
        B[j % N] =  T.diag_vector();
    }
}

So I'm not sure why the Array class is having multiple thread issues.

Although I don't think it is probably relevant, for some reason I can't create the shared library of Adept-2 only static libraries on MinGW. When it tries to build a shared library libtool says ...

libtool: warning: undefined symbols not allowed in x86_64-w64-mingw32 shared libraries; building static only

I have built both shared and static version of BLAS and LAPACK and can use the shared BLAS and LAPACK libraries to build a shared Armadillo library with them. I think that it's something to do with libtool. I just mention this in case the segmentation fault only happens with static builds.

Cheers, Jonti

rjhogan commented 5 years ago

Hi,

The issue is described (a bit) in section 3.15 of the Adept-2 documentation, and arises from the standard computer science problem of using objects that control their own destruction via reference counting, but must work in a multi-threaded program. The two things you tested were two possible solutions to the problem: (1) disabling reference counting for an object that needs to be accessed by multiple threads and (2) using a reference counter that can only be modified atomically. The fact that neither works implies there is something weird with OpenMP in your environment... so I'm not sure what to suggest.

Regards,

Robin.

On 30/10/2018 20:58, Jonti Olds wrote:

Hi Robin,

I had to add <<std::flush to the std::cout lines in test_thread_safe_arrays.cpp to see any output before the segmentation fault. I think this is due to buffering of iostreams with MinGW. If a program crashes you don't get to see the the output without std::flush. After adding that I get the following with ADEPT_STORAGE_THREAD_SAFE not defined...

Storage is not thread safe: using soft_link() 2 processors available running maximum of 2 threads Segmentation fault

and then with ADEPT_STORAGE_THREAD_SAFE defined...

Storage should be thread safe 2 processors available running maximum of 2 threads Segmentation fault

Looking where the segmentation fault happens it's in...

// The following almost always causes a crash if the code is not // properly thread safe

pragma omp parallel for

for (int i = 0; i < N1000; ++i) { for (int j = 0; j < N1000; ++j) { B[j % N] = noalias(B(__,j)) + T.diag_vector(); } }

The following one also causes the fault...

pragma omp parallel for

for (int i = 0; i < N1000; ++i) { for (int j = 0; j < N1000; ++j) { B[j % N] = noalias(B(__,j)) ; } }

But this following one is ok...

pragma omp parallel for

for (int i = 0; i < N1000; ++i) { for (int j = 0; j < N1000; ++j) { B[j % N] = T.diag_vector(); } }

So I'm not sure why the Array class is having multiple thread issues.

Although I don't think it is probably relevant, for some reason I can't create the shared library of Adept-2 only static libraries on MinGW. When it tries to build a shared library libtool says ...

libtool: warning: undefined symbols not allowed in x86_64-w64-mingw32 shared libraries; building static only

Cheers, Jonti

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/rjhogan/Adept-2/issues/9#issuecomment-434465474, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AMWiWQ7Ac8pHHmEiIogk7KL7bWPLMavGks5uqL1mgaJpZM4X-QaK.

-- Professor Robin J. Hogan Department of Meteorology University of Reading PO Box 243 READING RG6 6BB, UK

http:://www.met.rdg.ac.uk/~swrhgnrj/

jontio commented 5 years ago

Hi Robin,

Ah Ok. I tried updating my system with pacman -Su but that ended up braking something in BLAS/LAPACK/Armadillo so I had to compile them from source. After that nothing changed and I still got the multi thread problem. Something to remember but fortunately it's not important like the memory issue.

The memory problem would have been a show stopper as my interest in reverse auto differentiation is due to wanting to explore neural networks a little, so I needed to do lots of auto differentiations. Simple networks are relatively easy to figure out how the cost varies wrt the variables but I realized anything I wanted was crazy difficult, also it seemed somewhat silly spending so much time figuring out a derivative. I think I'm slowly understanding how reverse auto differentiation works but I'm not there yet. Certainly understanding how you do it efficiently for matrices I don't understand. I'm at the super simple level at the moment.

It doesn't stop me experimenting with it though. I've taught a vanilla recurrent neural network (RNN) to remember the last 2 inputs, xor them and return the result. I've used both the simplistic approach of using the derivatives I figured out by hand and also your library to do the differentiation for me. Both methods worked fine. Doing 100,000 trials took the pen and paper method 25s on my computer and your library method took 35s on my computer. The following figure shows the expected result of unheard random data entering the network for both of the training methods and the results coming out of the networks. As you can see both have been trained very well...

I went for RNN as you have to look back over time. That's simple to do for the by hand method but the auto differentiation method I was uncertain about, as the only way I could figure out how to do that was to clear the stack and refill it from scratch every time I took a time step. That seemed a waste of cpu and I would have rather trimmed a little of the bottom off the stack and added a little to top of the stack every time step. However, I'm not sure if that's possible or whether that would increase speed by much if any. Still the auto differentiation was a lot faster than I was anticipating and I am very happy with it. Next I'm going to experiment with multiple layers of GRUs as I can do that with auto differentiation but by hand that really would be horrid.

Attached is the source code for the RNN I talked about if you are interested...

qttest2.zip

Cheers, Jonti

rjhogan commented 5 years ago

Hi Jonti,

I'm glad it's at least working in single-threaded mode for your application. Adept arose from a minimization problem where I was spending so much time debugging my hand-coded adjoints that it became more time-efficient for me to develop Adept than to continue doing it all by hand.

Just to clarify that in Adept, automatic differentiation of matrix multiplication is not efficiently implemented in terms of memory. Normal matrix multiplication is O(N^2) in memory and O(N^3) for NxN matrices. In principle, reverse differentiation of this operation can be implemented to be O(N^2) in memory by storing matrices on the stack, then doing something different in the reverse pass, but I'd have to re-engineer Adept's stack to be much more flexible. I'd like to do this at some point, but at the moment to differentiate a matrix multiplication, Adept breaks it up into N^3 multiplications which results in O(N^3) memory usage. It also doesn't use BLAS when the matrices must be differentiated, which is slower.

I'm not an expert on NN or RNN... can I check the nature of your "look back in time" problem is as follows?

aVector x;

for (i = 0; i < ntimesteps; ++i) {

// ... complex operations that step "x" forward in time

if (i %100 == 0) { // Do something every 100th timestep

// ... Compute the derivative of x(current timestep) with respect to x(500 timesteps ago)

}

This could be done without re-doing the forward pass by storing the relevant values at intermediate timesteps, and then computing the derivative of one set of intermediate values with respect to another. However, Adept cannot currently run the reverse pass through only part of the stack. It would be pretty trivial to add this capability, but I would like more detail of the application (ideally with a simplified but working code that shows the bare-bones of this pattern).

Regards,

Robin.

jontio commented 5 years ago

HI Robin,

I am very much a newcomer to neural networks and I am making many mistakes on the way so I could still have things wrong.

I'm not quite sure your pseudocode means what I'm thinking when I talk about looking back in time.

I tried writing the equations down but I think the more confusing than beneficial. So instead I put together this image that hopefully explains the issue and what I implemented in code...

Everything in one of the rectangle and the calculation of J belong to one calculation that needs to go through Adept. The final equation is for updating U and doesn't go through Adept. J is the cost. Alpha is the step size and y hat is the vector you would like y to be. Everything is a vector or a matrix apart from J and alpha. The circles can be called nodes and just perform some calculations of the inputs. The two rectangles are the first step and the second step respectively. This is currently what I have implemented in code.

You can see from going from the first step to the second step I have removed the first node and added a new node to the end. The h input to the second rectangle comes from the h output of the first node from the first step. After writing down these pictures I realize U and the hs still changes from one step to the other so that would mean updating all sorts of parts in the stack and now doesn't look as simple as I thought removing one node and added another.

However all may not be lost as the following should also be fine...

This is what I meant by remove the bottom node and add a new node to the top. In this case nothing much changes apart from the top and the bottom.

Trying this out in code just now using the same setting as the previous tests results in a average cost of 0.00014 after 100,000 steps and took 48 seconds. So that looks good assuming I implemented it correctly.

It's like calculating a derivative over a sliding window.

I hope that makes sense.

Cheers, Jonti

rjhogan commented 5 years ago

Hi Jonti, Sorry, I did write a reply but I've just noticed it seems not to have appeared on github and it's not in my sent email. In your second case, I'm sure you don't need to take the derivative of all the nodes in the rectangle - if you only want to optimize the elements of the final node (U3) then you should be able to formulate it so that you only run that final node with adept to differentiate it. This presumes that once a node (e.g. U2) is defined, it is not changed in subsequent steps... There is something that looks rather ill-posed about what you're trying to do though. Firstly, you are applying the steepest-descent formula for a fixed step size, but only applying it once before moving to the next node. So you're not really minimizing J. It would seem to be better to run it for several iterations on the same node to really have a go at minimizing J before moving to the next node. Secondly, steepest descent is the least sophisticated of the numerous gradient-based minimization algorithms. If your problem is small (fewer than around 100 independent variables in each U node) then you can use the Gauss-Newton or Levenberg-Marquardt algorithms. For larger problems, you may do better with a quasi-Newton minimizer such as LBFGS (test/test_gsl_interface uses the GSL implementation of LBFGS - see also test/README in Adept), although note that LBFGS should be better than steepest descent for any size of problem. Cheers, Robin

jontio commented 5 years ago

Hi Robin,

It's been a while since I've been working on this problem so I'm a little more rusty. Yes I agree, looking back over what I wrote in the second case I can't see any reason why treating h3 as constant and using adept to find a good value for U4 would work. So I'm not sure what I was thinking there. Maybe it is just that simple, I'll have to have a look again and try it out to see if works if I can just autodifferentiate each node one at a time.

Interesting what you said about running multiple times before moving onto the next step I don't know if I thought about that, I will try that too and see what happens.

I hadn't heard of LBFGS before, very interesting. As far as I know usually for neural networks steepest descent is used. The number of independent variables in U can be huge and can be in the millions so it's a horrible problem. My understanding of LBFGS is zero at the moment but I had a look and found https://www.reddit.com/r/MachineLearning/comments/4bys6n/lbfgs_and_neural_nets/ where someone asked why it isn't used for neural networks very often. So people do occasionally use it but as far as I know all modern programs like tensorflow use auto differentiation and steepest descent at their core. Still I'll definitely look into LBFGS if for no other reason than I know nothing about it at the moment. Thanks.

Cheers, Jonti

rjhogan commented 5 years ago

Hi Jonti, Conjugate gradient is another minimization strategy similar to L-BFGS, and is an option in GSL as well. Anyway since I've fixed the original bug I'll close this thread now but feel free to email be directly with anything further on this topics. Regards, Robin.