Closed GoogleCodeExporter closed 9 years ago
This is a known issue and the unit tests demonstrate the failure. I have not
had
time to actively work on this project and a few months back I was having
trouble
understanding the scenario that caused it. When I have been emailed regarding
the
production status, as warned on the front-page, I have recommend alternative
approaches. I'm sorry, but I wasn't aware the Cassandra project was attempting
to
use it or else I would have warned you too!
There are a few excellent choices:
1) Use a soft-reference map, such as from Google Collections, and allow the GC
to
manage the memory. Soft references are reclaimed in LRU order.
2) Use a ReentrantLock to guard the list mutator operations. With
SecondChanceFifo
this provides excellent performance as reads remain lock-free. With LRU there
is
always contention on the tail and the loss won't be noticeable. This is a
simple and
safe change. It also provides a much smaller critical section than locking a
LinkedHashMap, so the necessary performance gain can be realized. I can provide
an
implementation if desired.
3) Use a standard algorithm! I chose to explore for personal education rather
than
adopt a known version, because this is not work related and I wanted to learn
rather
than copy. If it was work related then I would have adopted the algorithm
described
in:
http://www.cs.chalmers.se/~dcs/ConcurrentDataStructures/phd_chap7.pdf
Doug Lea has expressed interest in writing a standard version in a future
release of
java.util.concurrent, which will likely be based on his ConcurrentLinkedDeque
(available for, but not adopted in, JDK6). Personally, I am looking forward to
seeing how he attacks the challenge as its an interesting problem.
Please let me know if you'd like me to provide any further help. Since I have
been
asked 3 times this week, I can try to get back to this project and resolve the
race.
You can contact me here, on the group, or through private email (ben.manes-at-
gmail.com).
Best,
Ben
Original comment by Ben.Manes@gmail.com
on 1 Sep 2009 at 5:55
Hi Ben, Thanks for the quick response. I work with Chris. I have attached a
rough
approximation of the patch as you mentioned in #2. I'm not sure if the locks
are at
too coarse of a level. Also if it is necessary to have the lock in evict, but
it
seems to work. The unit tests weren't passing for me before and now they
consistently pass.
Cheers,
Sammy
Original comment by temi...@gmail.com
on 1 Sep 2009 at 6:27
Attachments:
Hi Sammy,
This was not what I meant, because the critical section includes all of the map
operations. Your change is almost equivilant to wrapping a LinkedHashMap with a
lock
(or synchronized if JDK6). That is the standard approach almost everyone uses,
because the most often the miss penalty (db hit) far outweights the hit penalty
(lock contention).
Instead only the Node<K, V> operations must use a global lock so that the list
is
maintained in a valid state. This provides lock splitting between the list and
map
and the ConcurrentHashMap has excellent locking performance (lock-free reads,
striped locks for writes). Because the critical section is small, the
ReentrantLock
will probably spin and not need to promote itself to a full lock under normal
usage
patterns. Since an LRU always creates contention this will be similar to a
lock-free
version in performance comparisions. It will also continue to allow a
SecondChanceFifo to use lock-free reads (SC-FIFO is a great choice under most
common
workloads).
I will make the necessary changes tomorrow evening, post a JAR, and explain the
code
on a wiki page. I will keep the version in the repository as is, though, to
allow
further experimentation for when I have some time. The fully functional JAR
should
allow people to get the necessary performance, without that nasty surprise in
production!
Best,
Ben
Original comment by Ben.Manes@gmail.com
on 1 Sep 2009 at 7:09
I see clhm-production.jar did get uploaded on the 4th. May I suggest that the
right
thing to do is make a tag or branch of the buggy version for experimentation,
apply
the fix to trunk, and close this issue so it's clear that the issue is resolved?
Original comment by jbel...@gmail.com
on 14 Sep 2009 at 6:49
Ben, you mentioned that this was a known issue and that there are unit tests
that
points this out.
As I'm currently investigating whether or not this bug affects my project, I
would
really appreciate if you could point out where and when this race condition
occurs.
Which unit tests reproduces this problem?
Is the problem solved now? I'm a bit curious since the bug is still open in the
tracker.
Thanks for a doing a great job.
/Lars Gråmark
Original comment by lars.gra...@gmail.com
on 19 Oct 2009 at 3:36
lars, the clhm-production.jar Ben posted does fix the bug in our (Cassandra)
testing.
Original comment by jbel...@gmail.com
on 19 Oct 2009 at 3:37
Ok, good to know. The question still remains how this can be reproduced.
Thanks!
/Lars
Original comment by lars.gra...@gmail.com
on 20 Oct 2009 at 7:32
Hi Lars,
Sorry I haven't gotten a chance to get back to this bug, but I just have been
utterly swamped. However I will try to become active again and follow jbellis'
excellent suggestion regarding branching to clarify things. (Thanks jbellis!)
I think I have figured out the scenario causing the bug, but I need to
determine how
best to resolve it. It can be reproduced if you download the code in /trunk and
execute the unit tests. The jar contains a special build using code posted on
the
wiki, which passes the unit tests. Once I have applied jbellis' suggestion,
that
functional version will be in /branch.
The reason why this bug occurs is very complicated to explain, but the state of
the
linked list becomes corrupt. Because no thread can make progress it results in
a
live-lock (all threads spinning). This will not happen in the "production" jar
which
uses locks efficiently rather than a pure CAS-based algorithm. The "production"
jar
passes the load testing and should satisfy your performance needs.
Thanks!
Ben
Original comment by Ben.Manes@gmail.com
on 20 Oct 2009 at 7:50
No worries. I ran the unit tests and they do fail as you mentioned. I also tried
similar tests to determine if the version tagged as 20080918 had the same
problems.
It appears as it does NOT have these concurrency problems.
Since we're using the 20080918 version and it seems to work fine, I guess we'll
stay
with that version a little longer.
/Lars
Original comment by lars.gra...@gmail.com
on 20 Oct 2009 at 12:44
Can you please use "clhm-production.jar", available in the download section,
instead? Thanks!
Original comment by Ben.Manes@gmail.com
on 20 Oct 2009 at 4:45
Perhaps, but I'm a bit hesitant since then I would be forced to run a very long
test
suite if I replace this core component. I'll just stick with the old version
now that
I know that it works fine and is not affected by this bug.
/Lars
Original comment by lars.gra...@gmail.com
on 21 Oct 2009 at 7:27
FYI,
I have a total rewrite that is passing all the tests, uses an LRU policy, and
is
very fast! With ConcurrentHashMap as the ideal case, I get 0.96x per read and
0.87x
per write performance. It scales very well as more threads are added. Early
benchmarks are:
# per read
ConcurrentLinkedHashMap: 0.85us
ConcurrentHashMap: 0.82us
LinkedHashMap: 19.31us
# per write
ConcurrentLinkedHashMap: 2.49us
ConcurrentHashMap: 2.17us
LinkedHashMap: 22.58 us
I plan on closing this bug as "WontFix" once I release this implementation.
Cheers,
Ben
Original comment by Ben.Manes@gmail.com
on 27 Mar 2010 at 9:24
Great, I look forward to seeing it.
Original comment by jbel...@gmail.com
on 27 Mar 2010 at 1:00
Checked in. I'd appreciate a code review. ;)
Original comment by Ben.Manes@gmail.com
on 30 Mar 2010 at 5:39
[deleted comment]
what is LIRS, and how do the numbers in comment #12 compare with the code from
the
[now] old "production" jar?
Original comment by jbel...@gmail.com
on 13 Apr 2010 at 3:42
I haven't benchmarked against the old "production" jar. If in LRU, far better
as
there is no lock contention. If in SecondChanceFifo mode then I would expect
that
reads would be equivalent and writes to be better. The production version
doesn't do
much on reads but set a per-entry volatile flag. However, on writes it may need
to
flush the FIFO queue if there are too many reads that it can't find a suitable
victim. That may occur on large caches where many entries were read, so the
penalty
could be quite high. On small caches it would be acceptable. The result is that
writes may have a high penalty and the policy degrade to a FIFO under certain
workloads.
This new version is a true LRU and the penalty for flushing the buffers is very
cheap. It should have equal read performance, cheap writes, a better hit rate,
and
most importantly no degradation/performance suffering scenarios. Its all around
better, imho. If Cassandra is using the production version for large caches, I
would
expect more reliable behavior with the new version.
LIRS is a page replacement policy that combines frequency with recency and
beats
LRU's hit rate. Most alternatives only work well under certain scenarios, but
LIRS
performs well under a variety. It can also be implemented in O(1) time like LRU
so
it has little overhead, but is similarly not concurrency-friendly. The design
of the
new CLHM should resolve the multi-threading issue, like it does for LRU. In my
synthetic benchmark of a prototype it equals or beats LRU (5%+ with a shoddy
workload). It performs about equal to IBM's ARC policy, but doesn't have the
patent
issues.
See:
- http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.437
- http://people.cs.vt.edu/~butta/docs/sigmetrics05_kernelPrefetch.pdf
Original comment by Ben.Manes@gmail.com
on 13 Apr 2010 at 4:03
Cool, we'll have a look at the new LRU code.
If trunk is at a stable point, can you tag a release so we don't have to track
the
svn revision we're testing manually? Because that's a pain. :)
Original comment by jbel...@gmail.com
on 13 Apr 2010 at 4:16
Sure, will do. I'm used to Perforce so give me a bit to figure SVN out. ;)
Original comment by Ben.Manes@gmail.com
on 13 Apr 2010 at 4:29
tag/release-1.0-LRU
Have fun! :)
Original comment by Ben.Manes@gmail.com
on 13 Apr 2010 at 4:39
thanks!
Original comment by jbel...@gmail.com
on 13 Apr 2010 at 4:44
Original issue reported on code.google.com by
chrisgof...@gmail.com
on 1 Sep 2009 at 5:23