Open tjad opened 4 years ago
Initially was using 1.5.1 , now using 1.3.1 same results
From what I can see, sometimes the value is corrupted, and sometimes the key is corrupted through imperative_invoke
.
It seems that key/values are arriving at imperative_invoke correctly, and hence being corrupted in the C binding.
Did I mention, this happens at some random later stage after setting values several times.
By these images you will see that 590 writes to this array using the same method, one by one in an iterative manner.
Input:
Error:
Imperative invoke in operation delegator : https://github.com/mrkn/mxnet.rb/blob/edc85f60baa51cab0b9bcecd307e1fefaacbcbf4/lib/mxnet/ndarray/operation_delegator.rb#L99
It is very likely that this is caused by Ruby's garbage collector. This happens as NDArray has non-blocking manipulations, hence the method can exit flagging its variables for garbage collection before mxnet has finished the execution of the manipulations.
One way to deal with this is to make a copy of the string buffers as opposed to passing the internal ruby string buffers around. This however incurs the unwanted performance hit. We will need to stop marking the memory for garbage collection in order to gain that performance.
https://github.com/mrkn/mxnet.rb/blob/master/ext/mxnet/libmxnet.c#L158
@mrkn Please confirm
Making a copy of the memory wont work without introducing manual garbage collection anyway.
Perhaps creating a pre garbage collector hook that awaits all write ops on NDArray instances
Premature garbage collection is definitely the cause of memory corruption problems. I've added a wait_on_read as a work around after all my writes. This is obviously not great for performance. I'm looking into placing NDArray tracking for the GC so that it can efficiently await all writes of the NDArrays before collecting garbage.
@tjad Could you show your code to reproduce this problem?
I looked at your commit https://github.com/Yobisense/mxnet.rb/commit/7a002892c2c608e860d8612aab107feb0c8813dc.
Calling rb_gc_mark
after rb_str_tmp_new
is meaningless because rb_gc_mark
should be called during the marking process of garbage collection.
Could you show your code to reproduce this problem?
I suspected that string objects allocated at this line may be collected by GC. Did you specify keys by symbols in your code?
Both keys_str
and values_str
needn't be marked explicitly because they are in the stack.
We need to add an array object to keep string objects created by rb_sym_to_s
.
@mrkn It's quite involved actually. You are correct in that we shouldn't need to mark the objects as they are on the stack, however, they are passed to MXNet in async, hence method exits and stack is popped including its variables, GC will not mark these and collect if there are no other strong references. This happens when the MXNet engine is not processing tasks/instructions fast enough, and the ruby application is creating objects frequently, causing frequent minor garbage collections. The memory passed to imperative_invoke
may be either
a.) Moved (GC is allowed to)
b.) Garbage collected.
To truly solve this problem, we'd have to make use of the MXNet callbacks for the best reliablity. We would have to keep an array of strong references to all those memory being passed to mxnet and clear the references from the array when the operations are complete. At least this is what I'm seeing.
This can be done efficiently by means of a reference pool that can expand as required
Manually marking anything in the stack will only help to some extent. When the GC passes 2 cycles even after manual marking the memory will be garbage collected.
I will try put together a test case. Hopefully the intuition above is sound though to get thinking going.
I think generally ruby applications keep data in memory (hence have a strong reference which stops the GC from clearing that data). In my application specifically, I'm running realtime data which is not stored in the heap whatsoever, so that memory may be cleared when GC runs as they are stack values without strong references. I am writing those values directly to NDArray, using slice assignment. It's A LOT of operations that likely fill up MXNet's task queue, and then whilst doing heavy operations (calculations), the assignments keep piling up and don't get processed quickly enough.
There is some change I can make in my application to avoid some of this, but ultimately it's a problem waiting to happen. It would be great to resolve this elegantly through a manner that we generally don't have to think about memory management.
@mrkn I tried marking keys_str
and vals_str
as I ran out of ideas, and just started marking everything. Ideally, params_keys
and params_vals
would be marked after all elements are set, to make use of GC's recursive marking behaviour. Of course, this doesn't help as per above - too many GC cycles before mxnet can process all operations.
As an aside (not sure how relevant this is). When using jemalloc, I never received an error like in 2.) I only receive error 3.) Dangling references
When using standard malloc, I get 2.) corrupt memory
I have not been able to successfully run TCMalloc with google perftools.
Hence, currently I am using jemalloc. I noticed that in mxnet 1.6.x jemalloc was removed as a default and not recommended in default use. (for whatever reason)
And yes @mrkn , my keys are symbols.
Reproducable failure
require 'mxnet'
def create_weak_ref_object_for_gc
Object.new
end
@nd_array = MXNet::NDArray.array([[1,2,3,4,5,6,7,8,9,10], [1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]])
def create_mxnet_assignment_task
@nd_array[Random.rand(4), Random.rand(10)] = Random.rand #use random float to avoid internal cached ruby objects
end
def create_mxnet_operations_task
@nd_array.sum
end
j = 0
while true
for i in 1..1000
create_weak_ref_object_for_gc
end
for i in 1..100
create_mxnet_assignment_task
for i in 1..10
create_mxnet_operations_task
end
end
puts "Queued #{j}"
j+=1
end
Success Code
require 'mxnet'
@nd_array = MXNet::NDArray.array([[1,2,3,4,5,6,7,8,9,10], [1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]])
@random = (1..100).map {Random.rand}
def create_mxnet_assignment_task
@nd_array[Random.rand(4), Random.rand(10)] = @random[Random.rand(100)]
end
def create_mxnet_operations_task
@nd_array.sum
end
j = 0
while true
create_mxnet_assignment_task
for i in 1..10
create_mxnet_operations_task
end
puts "Queued #{j}"
j+=1
end
@mrkn Above please find my example/ test . Notice that the failure code has numerous objects created to have the GC invoked quickly.
Let me know if this is reproducible on your environment please. Will be appreciated, thank you.
There are 2 errors that I noticed, one seems to be an internal mxnet error which gives a full memory dump. I assume this is due to the huge amount of memory used when the queue is overloaded.
But the one in question for mxnet.rb is a dangling reference as per 3.) in inital post
@mrkn It turns out that my patch (marking the objects) does resolve this problem. When I was testing, my environment was not using my patched version. Making a pull request.
@toddsundsted I noticed you were quite involved with some code related to GC . Perhaps you'd have some insight around this issue and/or others relating to GC.
Using Ruby 3.x, I find that my above failure sample does not crash.
I am however experiencing different strange errors under high workload demand - frequent object creation in ruby and tasks queuing in mxnet. MXNet::NDArray.sum
and instance method my_nd_array_instance.sum
appear to use different execution paths as I retrieve different results for the same nd_array instance. These results are incorrect. By this I mean I take the elements (8 of them) of the nd array instance, pop them into my irb and execute both sum methods where they return a third result which is consistent between the methods, and very different from the in application results.
I'm not opening a separate issue for this as I believe all these errors are from the same source. I may be wrong. It's difficult to reproduce this environment exactly.
@mrkn @toddsundsted Could either of you please do a review of the code that expose an stype getter and advise. It's an attempt for #51
The code looks clean to me, changing no existing functionality. However, upon addition of this code, the functionality of NDArray breaks such that initialization of NDArray fills with random values when using .zeros
, .ones
, .array
etc, it is also no longer possible to make assignments to the array correctly. It is as if the array handle is bad.
https://github.com/Yobisense/mxnet.rb/commit/0e1fab708566a5429689e95d34017cd3830c2971
You will notice there are no code changes, only additions. The closest code change is addition of a local variable mStorageType
.
To reproduce, checkout code and run in IRB
require 'mxnet'
a = MXNet::NDArray.ones(5)
[-1.80243e+08, 4.57118e-41, -1.80256e+08, 4.57118e-41, 0] <MXNet::NDArray 5 @cpu(0)>
Whichever initializer I use, the result is similar.
@tjad I'll look at it tomorrow.
Thank you in advance!
@tjad Why don't you create a new pull request in this repo?
@mrkn I will do that.
I'm new to community collaboration - and would expect that the code has to be proven working before making pull request.
There are some other branches with work I need to make PR for too. Those do not have tests yet. If you accept the state of them, I could do a PR also. (#49)
UPDATE: Tests for #49 are coming. Plans to do project wide test phase toward end of completion of gluon api features (#50)
@tjad Please create a new pull-request. I recognize we've already communicated and had the same goal to do the migration of this repo to the upstream.
@tjad And, you also can create WIP-state pull-requests and can mention me to get some information of this repo anytime on these pull-requests.
Awesome! Thank you :-)
I'm receiving some strange errors when using this gem in conjunction with eventmachine.
I'm rewriting values into a single NDArray instance (reused). Up to 140 writes per second.
Randomly my event loop crashed from mxnet segfault errors. I'm going to push a list of errors in this thread.
These errors appear regardless of my context, so for simplicity, assume these are in CPU context.
1.)
malloc_consolidate(): invalid chunk size
2.) In this error, there are sometimes unicode characters present for attributes as opposed to blank
3.) basic_string::_M_construct null not valid