tonsky / persistent-sorted-set

Fast B-tree based persistent sorted set for Clojure/Script
MIT License
81 stars 16 forks source link

Child is unexpectedly nil during stress test. #13

Open whilo opened 11 months ago

whilo commented 11 months ago

While stress testing datahike distribution I ran into this error earlier against version 0.3.0. I cannot make sense yet of where it is coming from, it only happened once so far during many load tests. The process run is external to any REPL and run as a dedicated server (transactor). I will investigate it further and put this here to keep a handle on the issue.

{:type java.lang.NullPointerException
   :message "Cannot invoke \"me.tonsky.persistent_sorted_set.ANode.store(me.tonsky.persistent_sorted_set.IStorage)\" because \"this._children[<local2>]\" is null"
   :at [me.tonsky.persistent_sorted_set.Branch store "Branch.java" 621]}]
 :trace
 [[me.tonsky.persistent_sorted_set.Branch store "Branch.java" 621]
 176].tonsky.persistent_sorted_set.PersistentSortedSet store "PersistentSortedSet.java"
  [me.tonsky.persistent_sorted_set$store invokeStatic "persistent_sorted_set.clj" 185]
  [me.tonsky.persistent_sorted_set$store invoke "persistent_sorted_set.clj" 180]
  ...

(stacktrace printing was interleaved for some reason)

There are assertions before line 621 that should blow up before. My understanding is also that nodes that are newly created (and hence don't have an address) cannot be soft references (which I use currently), so I am not sure how this child can be nil here.

Edit: Maybe one of the addresses is accidentally nulled somehow while its child is actually pointing to an already stored SoftReference.

tonsky commented 11 months ago

Is this multithreaded? Might be the issue

whilo commented 11 months ago

Yes, I hammer the server with a lot of concurrent transactions in this stress test. It is possible that it is a problem with my store implementation, but I cannot see how it would fail in exactly that line. If you have an idea that might already be helpful.

tonsky commented 11 months ago

GC might happen between asserts and 621, erasing that soft reference

whilo commented 11 months ago

Yes, that is what I thought as well. Which is very rare to happen, I guess. Thanks for confirming.

tonsky commented 11 months ago

If you can help me set up testing for this I’d appreciate it. It’s certainly a bug worth fixing

whilo commented 11 months ago

Definitely. I looked at your stress tests last night,besides not triggering storage yet they don't create that much pressure on GC. We provide HTTP bindings which create quite a lot of pressure on the GC in themselves if you have to send every transaction through REST and JSON decode it (which is the stress test I executed for 10k individual transactions). It might work to just set heap sizes small and extend the current stress tests to run multiple threads at once. Still a bug like this will only be triggered very rarely so it might make sense to run such tests for a few hours or so.

tonsky commented 11 months ago

I’ve set up this https://github.com/tonsky/datascript/blob/master/script/stress_storage.sh but wasn’t lucky yet capturing any races

whilo commented 11 months ago

I am chasing another issue atm. and will revisit this after. I haven't triggered this bug again in many hours of testing I did either though. I think it would be good to also spawn many threads and read and write at the same time in your tests. Maybe we can setup generative stress tests like that. It could also be reasonable to do stress tests from DataScript/Datahike as well as those stacks trigger particular workloads that are closer to what will happen in reality.