Open whilo opened 11 months ago
Is this multithreaded? Might be the issue
Yes, I hammer the server with a lot of concurrent transactions in this stress test. It is possible that it is a problem with my store implementation, but I cannot see how it would fail in exactly that line. If you have an idea that might already be helpful.
GC might happen between asserts and 621, erasing that soft reference
Yes, that is what I thought as well. Which is very rare to happen, I guess. Thanks for confirming.
If you can help me set up testing for this I’d appreciate it. It’s certainly a bug worth fixing
Definitely. I looked at your stress tests last night,besides not triggering storage yet they don't create that much pressure on GC. We provide HTTP bindings which create quite a lot of pressure on the GC in themselves if you have to send every transaction through REST and JSON decode it (which is the stress test I executed for 10k individual transactions). It might work to just set heap sizes small and extend the current stress tests to run multiple threads at once. Still a bug like this will only be triggered very rarely so it might make sense to run such tests for a few hours or so.
I’ve set up this https://github.com/tonsky/datascript/blob/master/script/stress_storage.sh but wasn’t lucky yet capturing any races
I am chasing another issue atm. and will revisit this after. I haven't triggered this bug again in many hours of testing I did either though. I think it would be good to also spawn many threads and read and write at the same time in your tests. Maybe we can setup generative stress tests like that. It could also be reasonable to do stress tests from DataScript/Datahike as well as those stacks trigger particular workloads that are closer to what will happen in reality.
While stress testing datahike distribution I ran into this error earlier against version
0.3.0
. I cannot make sense yet of where it is coming from, it only happened once so far during many load tests. The process run is external to any REPL and run as a dedicated server (transactor). I will investigate it further and put this here to keep a handle on the issue.(stacktrace printing was interleaved for some reason)
There are assertions before line
621
that should blow up before. My understanding is also that nodes that are newly created (and hence don't have an address) cannot be soft references (which I use currently), so I am not sure how this child can be nil here.Edit: Maybe one of the addresses is accidentally nulled somehow while its child is actually pointing to an already stored
SoftReference
.