Segfaults | Memory corruption errors

tjad commented 4 years ago

I'm receiving some strange errors when using this gem in conjunction with eventmachine.

I'm rewriting values into a single NDArray instance (reused). Up to 140 writes per second.

Randomly my event loop crashed from mxnet segfault errors. I'm going to push a list of errors in this thread.

These errors appear regardless of my context, so for simplicity, assume these are in CPU context.

1.)
malloc_consolidate(): invalid chunk size

2.) In this error, there are sometimes unicode characters present for attributes as opposed to blank

ruby/2.6.0/bundler/gems/mxnet.rb-edc85f60baa5/lib/mxnet/ndarray/operation_delegator.rb:91:in `imperative_invoke': MXNetError: Cannot find argument '', Possible Arguments: (MXNet::Error)
----------------
scalar : double, optional, default=0
    The scalar value for assignment.
begin : Shape(tuple), required
    starting indices for the slice operation, supports negative indices.
end : Shape(tuple), required
    ending indices for the slice operation, supports negative indices.
step : Shape(tuple), optional, default=[]
    step for the slice operation, supports negative values.
, in operator _slice_assign_scalar(name="", scalar="267.55", step="[1, 1]", end="[181, 1]", ="[180, 0]")

3.) basic_string::_M_construct null not valid

tjad commented 4 years ago

Initially was using 1.5.1 , now using 1.3.1 same results mem corruption

tjad commented 4 years ago

From what I can see, sometimes the value is corrupted, and sometimes the key is corrupted through imperative_invoke.

It seems that key/values are arriving at imperative_invoke correctly, and hence being corrupted in the C binding.

Did I mention, this happens at some random later stage after setting values several times.

tjad commented 4 years ago

By these images you will see that 590 writes to this array using the same method, one by one in an iterative manner.

Input: imperative_invoke_input

Error: imperative_invoke_error

Imperative invoke in operation delegator : https://github.com/mrkn/mxnet.rb/blob/edc85f60baa51cab0b9bcecd307e1fefaacbcbf4/lib/mxnet/ndarray/operation_delegator.rb#L99 imperative_invoke

tjad commented 4 years ago

It is very likely that this is caused by Ruby's garbage collector. This happens as NDArray has non-blocking manipulations, hence the method can exit flagging its variables for garbage collection before mxnet has finished the execution of the manipulations.

One way to deal with this is to make a copy of the string buffers as opposed to passing the internal ruby string buffers around. This however incurs the unwanted performance hit. We will need to stop marking the memory for garbage collection in order to gain that performance.

https://github.com/mrkn/mxnet.rb/blob/master/ext/mxnet/libmxnet.c#L158

@mrkn Please confirm

tjad commented 4 years ago

Making a copy of the memory wont work without introducing manual garbage collection anyway.

Perhaps creating a pre garbage collector hook that awaits all write ops on NDArray instances

tjad commented 4 years ago

Premature garbage collection is definitely the cause of memory corruption problems. I've added a wait_on_read as a work around after all my writes. This is obviously not great for performance. I'm looking into placing NDArray tracking for the GC so that it can efficiently await all writes of the NDArrays before collecting garbage.

mrkn commented 4 years ago

@tjad Could you show your code to reproduce this problem?

I looked at your commit https://github.com/Yobisense/mxnet.rb/commit/7a002892c2c608e860d8612aab107feb0c8813dc. Calling rb_gc_mark after rb_str_tmp_new is meaningless because rb_gc_mark should be called during the marking process of garbage collection.

Could you show your code to reproduce this problem?

I suspected that string objects allocated at this line may be collected by GC. Did you specify keys by symbols in your code?

mrkn commented 4 years ago

Both keys_str and values_str needn't be marked explicitly because they are in the stack. We need to add an array object to keep string objects created by rb_sym_to_s.

tjad commented 4 years ago

@mrkn It's quite involved actually. You are correct in that we shouldn't need to mark the objects as they are on the stack, however, they are passed to MXNet in async, hence method exits and stack is popped including its variables, GC will not mark these and collect if there are no other strong references. This happens when the MXNet engine is not processing tasks/instructions fast enough, and the ruby application is creating objects frequently, causing frequent minor garbage collections. The memory passed to imperative_invoke may be either a.) Moved (GC is allowed to) b.) Garbage collected.

To truly solve this problem, we'd have to make use of the MXNet callbacks for the best reliablity. We would have to keep an array of strong references to all those memory being passed to mxnet and clear the references from the array when the operations are complete. At least this is what I'm seeing.

This can be done efficiently by means of a reference pool that can expand as required

Manually marking anything in the stack will only help to some extent. When the GC passes 2 cycles even after manual marking the memory will be garbage collected.

tjad commented 4 years ago

I will try put together a test case. Hopefully the intuition above is sound though to get thinking going.

I think generally ruby applications keep data in memory (hence have a strong reference which stops the GC from clearing that data). In my application specifically, I'm running realtime data which is not stored in the heap whatsoever, so that memory may be cleared when GC runs as they are stack values without strong references. I am writing those values directly to NDArray, using slice assignment. It's A LOT of operations that likely fill up MXNet's task queue, and then whilst doing heavy operations (calculations), the assignments keep piling up and don't get processed quickly enough.

There is some change I can make in my application to avoid some of this, but ultimately it's a problem waiting to happen. It would be great to resolve this elegantly through a manner that we generally don't have to think about memory management.

tjad commented 4 years ago

@mrkn I tried marking keys_str and vals_str as I ran out of ideas, and just started marking everything. Ideally, params_keys and params_vals would be marked after all elements are set, to make use of GC's recursive marking behaviour. Of course, this doesn't help as per above - too many GC cycles before mxnet can process all operations.

tjad commented 4 years ago

As an aside (not sure how relevant this is). When using jemalloc, I never received an error like in 2.) I only receive error 3.) Dangling references

When using standard malloc, I get 2.) corrupt memory

I have not been able to successfully run TCMalloc with google perftools.

Hence, currently I am using jemalloc. I noticed that in mxnet 1.6.x jemalloc was removed as a default and not recommended in default use. (for whatever reason)

tjad commented 4 years ago

And yes @mrkn , my keys are symbols.

tjad commented 4 years ago

Reproducable failure

require 'mxnet'

def create_weak_ref_object_for_gc
    Object.new
end

@nd_array = MXNet::NDArray.array([[1,2,3,4,5,6,7,8,9,10], [1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]])    

def create_mxnet_assignment_task
    @nd_array[Random.rand(4), Random.rand(10)] = Random.rand #use random float to avoid internal cached ruby objects
end

def create_mxnet_operations_task
    @nd_array.sum
end

j = 0
while true
    for i in 1..1000
        create_weak_ref_object_for_gc
    end
    for i in 1..100
        create_mxnet_assignment_task
        for i in 1..10
            create_mxnet_operations_task
        end
    end

    puts "Queued #{j}"
    j+=1
end

Success Code

require 'mxnet'

@nd_array = MXNet::NDArray.array([[1,2,3,4,5,6,7,8,9,10], [1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]])    

@random = (1..100).map {Random.rand}

def create_mxnet_assignment_task
    @nd_array[Random.rand(4), Random.rand(10)] = @random[Random.rand(100)] 
end

def create_mxnet_operations_task
    @nd_array.sum
end

j = 0
while true
    create_mxnet_assignment_task
    for i in 1..10
        create_mxnet_operations_task
    end

    puts "Queued #{j}"
    j+=1
end

tjad commented 4 years ago

@mrkn Above please find my example/ test . Notice that the failure code has numerous objects created to have the GC invoked quickly.

Let me know if this is reproducible on your environment please. Will be appreciated, thank you.

tjad commented 4 years ago

There are 2 errors that I noticed, one seems to be an internal mxnet error which gives a full memory dump. I assume this is due to the huge amount of memory used when the queue is overloaded.

But the one in question for mxnet.rb is a dangling reference as per 3.) in inital post

tjad commented 4 years ago

@mrkn It turns out that my patch (marking the objects) does resolve this problem. When I was testing, my environment was not using my patched version. Making a pull request.

tjad commented 3 years ago

@toddsundsted I noticed you were quite involved with some code related to GC . Perhaps you'd have some insight around this issue and/or others relating to GC.

Using Ruby 3.x, I find that my above failure sample does not crash.

I am however experiencing different strange errors under high workload demand - frequent object creation in ruby and tasks queuing in mxnet. MXNet::NDArray.sum and instance method my_nd_array_instance.sum appear to use different execution paths as I retrieve different results for the same nd_array instance. These results are incorrect. By this I mean I take the elements (8 of them) of the nd array instance, pop them into my irb and execute both sum methods where they return a third result which is consistent between the methods, and very different from the in application results.

2mths 2meths inf

I'm not opening a separate issue for this as I believe all these errors are from the same source. I may be wrong. It's difficult to reproduce this environment exactly.

tjad commented 3 years ago

@mrkn @toddsundsted Could either of you please do a review of the code that expose an stype getter and advise. It's an attempt for #51 The code looks clean to me, changing no existing functionality. However, upon addition of this code, the functionality of NDArray breaks such that initialization of NDArray fills with random values when using .zeros, .ones, .array etc, it is also no longer possible to make assignments to the array correctly. It is as if the array handle is bad. https://github.com/Yobisense/mxnet.rb/commit/0e1fab708566a5429689e95d34017cd3830c2971

You will notice there are no code changes, only additions. The closest code change is addition of a local variable mStorageType.

https://github.com/Yobisense/mxnet.rb/commit/0e1fab708566a5429689e95d34017cd3830c2971#diff-59f905089f1263b9eb6c2ffcdc239960ca69924dee3a3f862bd95d383338b5cfR864

To reproduce, checkout code and run in IRB

require 'mxnet'

a = MXNet::NDArray.ones(5)

[-1.80243e+08, 4.57118e-41, -1.80256e+08, 4.57118e-41, 0] <MXNet::NDArray 5 @cpu(0)>

Whichever initializer I use, the result is similar.

mrkn commented 3 years ago

@tjad I'll look at it tomorrow.

tjad commented 3 years ago

Thank you in advance!

mrkn commented 3 years ago

@tjad Why don't you create a new pull request in this repo?

tjad commented 3 years ago

@mrkn I will do that.

I'm new to community collaboration - and would expect that the code has to be proven working before making pull request.

There are some other branches with work I need to make PR for too. Those do not have tests yet. If you accept the state of them, I could do a PR also. (#49)

UPDATE: Tests for #49 are coming. Plans to do project wide test phase toward end of completion of gluon api features (#50)

mrkn commented 3 years ago

@tjad Please create a new pull-request. I recognize we've already communicated and had the same goal to do the migration of this repo to the upstream.

mrkn commented 3 years ago

@tjad And, you also can create WIP-state pull-requests and can mention me to get some information of this repo anytime on these pull-requests.

tjad commented 3 years ago

Awesome! Thank you :-)

mrkn / mxnet.rb

Segfaults | Memory corruption errors #45