Closed GoogleCodeExporter closed 8 years ago
Nothing conclusive, but I used some specific flags, increased the number of
iterations and changed the retrieve line to print a single "." (so I can find
things
among the verbose output).
Findings:
- A number of the early timeouts correlate closely to classloader activity
- Later timeouts tend to come in groups leading me to believe it's some
external factor
- As reported, the timeouts don't correlate well to GC activity
- As the run continues, the timeouts tend to reach a steady state, but are
lower than
in the ramp up of the run.
It's looking a lot like this is more about external factors that keep the
application
from being scheduled.
I may be able to get some more data with some lower level tools.
Original comment by ingen...@gmail.com
on 9 Mar 2010 at 9:23
Well, I should be sleeping, but I think I've solved this one.
A quick-ish dtrace script to count the number of times a spymemcached thread is
kicked off the CPU for more than 4.5ms shows a really strong correlation with
the
count from within your program.
Across 1,000,000 iterations
From your program:
Timeout count is 218
From DTrace:
Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211} 197
The latter one there is the thread name and the count.
Interestingly, the threads either do their thing right away, or they get
starved as
you see in a histogram of the time off CPU. By the way, I counted anything over
4.5ms in here, which is why you see so many just under 5.
Memcached IO over {MemcachedConnection to localhost/127.0.0.1:11211}
value ------------- Distribution ------------- count
< 5 |@@@@@@@@@@@@@@@@@ 85
5 | 0
6 | 0
7 | 0
8 | 0
9 | 0
10 | 0
11 | 0
12 | 0
13 | 0
14 | 0
15 | 0
16 | 0
17 | 0
18 | 0
19 | 0
20 | 0
21 | 0
22 | 0
23 | 0
24 | 0
25 | 0
26 | 0
27 | 0
28 | 0
29 | 0
30 | 0
31 | 0
32 | 0
33 | 0
34 | 0
35 | 0
36 | 0
37 | 0
38 | 0
39 | 0
40 | 0
41 | 0
42 | 0
43 | 0
44 | 0
45 | 0
46 | 0
47 | 0
48 | 0
49 | 0
50 | 0
51 | 0
52 | 0
53 | 0
54 | 0
55 | 0
56 | 0
57 | 0
58 | 0
59 | 0
60 | 0
61 | 0
62 | 0
63 | 0
64 | 0
65 | 0
66 | 0
67 | 0
68 | 0
69 | 0
70 | 0
71 | 0
72 | 0
73 | 0
74 | 0
75 | 0
76 | 0
77 | 0
78 | 0
79 | 0
80 | 0
81 | 0
82 | 0
83 | 0
84 | 0
85 | 0
86 | 0
87 | 0
88 | 0
89 | 0
90 | 0
91 | 0
92 | 0
93 | 0
94 | 0
95 | 0
96 | 0
97 | 0
98 | 0
99 | 0
>= 100 |@@@@@@@@@@@@@@@@@@@@@@@ 112
The solutions to this are:
- Rewrite spymemcached to use Java real time; make everyone buy it or run
Solaris to
get it for free
- Tune the scheduler time quantums for your OS to ensure some of those threads
are
less starved; I know a bit about how to do it on Solaris/OpenSolaris but many
other
platforms may require lots of surgery
- Buy more CPUs
- Turn up your timeout value
I recommend going for one of the last two options. :)
Original comment by ingen...@gmail.com
on 9 Mar 2010 at 11:03
Attachments:
Well, after sleeping on this, one thing which could be done is limiting the
number of
outstanding things to be done. I would have to look into the details here to
see
what the best way to do this is. Still if threads are being taken off CPU for
the
amount of time in the timeout, it will timeout.
Original comment by ingen...@gmail.com
on 9 Mar 2010 at 9:31
Original comment by ingen...@gmail.com
on 14 Mar 2010 at 11:45
Is there a way to set the priority of the SpyMemcache java thread to be
higher? so we less like to get starvation?
Original comment by S.Ille...@gmail.com
on 15 Sep 2010 at 8:01
Probably, though that's relatively JVM/OS dependent. On Solaris, for instance,
there are com.sun.* APIs to change the priorities of threads in that JVM.
Other JVM builds may have similar capabilities, and Sun's JVM for Linux may
support this.
The other thing that may be possible is setting up affinity for these threads
to CPUs which are not processing other tasks on the system. Specifically, move
interrupts to a set of cores, and bind the JVM to dedicated CPU cores for this
processing. These are all things that are "normally" done when trying to
ensure low latency for scheduling of tasks.
Note that none of it is a guarantee though. You may be able to limit the rate
of your requests to guarantee the rest are more likely to get through. This is
pretty similar to getting more CPUs or dedicating resources.
Which JVM/OS are you seeing issues on?
Original comment by ingen...@gmail.com
on 15 Sep 2010 at 8:51
Closing this one as there are a number of things to help from issue 136, and I
think the rest of the possible issues can be addressed with documentation about
what a timeout is and is not.
Basic summary: set a timeout if you would reasonably do something different
after the timeout interval. If what you're trying to do is ensure things
happen with as little latency as possible, setting a lower timeout won't help.
Original comment by ingen...@gmail.com
on 20 Apr 2011 at 6:27
Original issue reported on code.google.com by
ingen...@gmail.com
on 9 Mar 2010 at 8:15