jvm-profiling-tools / honest-profiler

A sampling JVM profiler without the safepoint sample bias
https://github.com/RichardWarburton/honest-profiler/wiki
MIT License
1.25k stars 146 forks source link

Crash in lock free map test on Travis #224

Closed nitsanw closed 6 years ago

nitsanw commented 6 years ago

https://travis-ci.org/jvm-profiling-tools/honest-profiler/jobs/278317972#L968

ikavalio commented 6 years ago

Hi, I'm already looking for a solution, the problem is that same instances of Migration class (within the same epoch) are scheduled for a removal by GC. It will be great if we enable address sanitizer for travis builds, so it will be easier to understand the problem next time.

gauravAshok commented 6 years ago

Hi, I might have stumbled upon a related issue.
Backtrace from the core dump : https://gist.github.com/gauravAshok/c3a12b05e92f1fbc1889ad399b3f8d07 In this case, the service had 2 threads being created and destroyed every 2 min.

I have seen that test case fail few times now. I added another test case where n threads were adding and removing keys. The keys used by these threads were local to them, so there was no concurrent key access. If it helps below is the trace dump for a failed test. --------------- LFMap [LockFreeMapPrimitives::find] Item not found: 0 [LockFreeMapPrimitives::find] Item found: 601512 [LockFreeMapPrimitives::insertOrUpdate] Max number of jumps reached: 2 [LockFreeMapPrimitives::insertOrUpdate] Update bucket value: 598330 [LockFreeMapPrimitives::insertOrUpdate] Update value race detected: 0 [LockFreeMapPrimitives::insertOrUpdate] Can't allocate a cell, map is too full: 10 [LockFreeMapPrimitives::insertOrUpdate] Allocate fresh bucket: 68550 [LockFreeMapPrimitives::insertOrUpdate] Allocation race detected: 2 [LockFreeMapPrimitives::insertOrUpdate] Set bucket value: 68546 [LockFreeMapPrimitives::insertOrUpdate] Set value race detected: 0 [LockFreeMapPrimitives::insertOrUpdate] No empty cell in the neighbourhood: 47 [LockFreeMapProvider::remove] Item not found: 0 [LockFreeMapProvider::remove] Remove bucket's value: 601507 [LockFreeMapProvider::remove] Remove value race detected (giving up): 1 [LockFreeMapProvider::migrationStart#1] Conflicting migration found: 1 [LockFreeMapProvider::migrationStart#2] Conflicting migration found: 45 [Migration::run] Can't participate in job that already ended: 1 [Migration::run] Migration interrupted by end of job: 5 [Migration::run] Overflow detected during migration: 2 [Migration::run] No more blocks to migrate: 37 [Migration::run] Not the last thread: 42 [Migration::run] Overflow migration already started: 0 [Migration::migrateRange] Unallocated cell migration flag: 4169 [Migration::migrateRange] Unallocated cell insert race: 0 [Migration::migrateRange] Allocated cell value insert race: 0 [Migration::migrateRange] Allocated cell migration flag: 5921 [Migration::migrateRange] Racing erase when migrating allocated bucket: 0

RichardWarburton commented 6 years ago

Thanks for the report gents. I'll definitely look into this one, but probably won't have time until next week. Its also worth noting that the map I believe comes from Jeff Preshing - so when we do fix it, we should probably notify him and send the fix upstream as well.

ikavalio commented 6 years ago

When I was porting concurrent map from Jeff I though that I'd simplify code a bit and changed the way counters are incremented for threads that are participating in migration, but I made an unfortunate mistake instead and as a result temporary migration object could be deleted multiple times. So basically I used fetch_add(2, std::memory_order_relaxed) where more strict CAS loop was needed (as it was in the original code). After the fix everything looks good in that particular place. I'll add address sanitizer to travis ci build and fix osx build settings in it and submit a PR for review.

gauravAshok commented 6 years ago

In LockFreeMapPrimitives::insertOrUpdate(), lets say you succeed in finding the bucket with the tHash and the oldValue was 0. In that case you will proceed with cmpexchge. But this sometimes fails because meanwhile migration was started and the value has been changed to -1. May be we should check for the oldValue == -1 again after the failed exchange. Thoughts? Similar behaviour in map::remove().

ikavalio commented 6 years ago

Good catch, @gauravAshok and thanks for reviewing the code, it looks like a bug and those checks should be added indeed. Only inserts have such check as opposed to updates and deletes. I believe that happened because original map uses slightly different update/remove strategy. It definitely doesn't break anything and I think you should submit a PR for that.