Closed GoogleCodeExporter closed 9 years ago
I am seeing the same problem, but with distcc 3.1.
Original comment by rustysawdust@gmail.com
on 18 Feb 2009 at 11:45
The whole process of reading multiple state files that get created and deleted
by the
monitoring programs is hoaky. It is possible that in the process of taking a
snapshot of the entries of the .distcc/state directory, multiple compiles on
the same
host/slot pair get triggered. Note that both the text based and the GUI based
monitor suffer from the same problem, since they use the same underlying
function
dcc_mon_poll.
A quick fix for this is to ignore multiple compiles on the same host/slot pair,
for
which I have attached a patch below. This is not an ideal solution, but should
do
the job.
Original comment by pankaj.c...@gmail.com
on 26 Aug 2009 at 3:58
Attachments:
I'm pretty sure that I have found that actual cause of the bug. Will upload a
patch as soon as I solve a few issues in it.
Original comment by jeremy.w...@gmail.com
on 20 Jun 2010 at 2:45
OK. So, the bug appears to be in the client, in the handling of the my_state
variable. Since 3.0 introduced the possibility of a file being partly
processed locally and remotely, but the state information was all being stored
in my_state, resulting in mixed up data. So the lock files are created
correctly, but the state files are not.
This fix simply creates a second state variable, differentiating between local
and remote, and stores state data in the appropriate one. When
dcc_write_state() is called, it uses whichever state variables was last updated.
It works for me, so I would love to hear if it works for you! Cheers.
PS. This is the first time that I have looked at the distcc codebase, so the
usual disclaimers of naivety apply: I may have inadvertently created a
monstrous new bug!
Original comment by jeremy.w...@gmail.com
on 21 Jun 2010 at 3:28
Attachments:
Observed the bug last night for the first time since I have been running the
patched client. Didn't see where it happened, so am not able to reproduce it.
Original comment by jeremy.w...@gmail.com
on 27 Jun 2010 at 6:40
I have been bugged by this particular bug for a long while. Finally I had the
time and energy to investigate, and found this to be the problem:
dcc_build_somewhere calls dcc_pick_host_from_list_and_lock_it, which calls
dcc_lock_one. This particular function sets my_state.slot to whatever slot is
available from whatever host is picked. However, when returning to
dcc_build_somewhere the code gets into dcc_lock_local_cpp, which again calls
dcc_lock_one for localhost. This, in turn, overwrites the previously stored
slot with the local slot chosen for preprocessing. Finally, in
dcc_lock_local_cpp, the state is committed to disk for the monitor to pick it
up, but with the remote slot and local slot values.
I came to this conclusion after reading jeremy's post, but my patch does less:
let my_state.slots be set just once, by the first call to
dcc_pick_host_from_list_and_lock_it. It appears to work.
Judging by how communication between the distcc host/slot allocation and the
monitor state reading is done, I think getting duplicate host+slot entries is
possible if the monitor is reading the state files while one (or more) of the
processes are modifying the files, but this should occur rarely.
Hope this helps :)
Original comment by cheepe...@gmail.com
on 6 Jul 2010 at 1:11
Attachments:
Hi cheepeero, I just tested your patch on rev 728 and it actually made the
symptoms worse: lots of duplicate localhost entries appeared within seconds.
Have you tested it recently? Cheers.
Original comment by jeremy.w...@gmail.com
on 10 Sep 2010 at 8:17
I will have to check what is different with rev 728; as the name suggests my
patch was made against distcc-3.1. I'll also investigate if the gentoo distro
also aplies some other patches as well.
I use distcc for building my gentoo systems; with the patch all behave well
(one single-core, 1 dual core and one with 8 logical cores). Here is a
screenshot of how it looks with my patch on the dual-core while building glib.
Another thing that might cause a difference is that I have limited on each
system the localhost slots in /etc/distcc/hosts.
Original comment by cheepe...@gmail.com
on 10 Sep 2010 at 9:39
Attachments:
On the previous run I had -j8 left in the gentoo configuration. Here is another
screenshot on mysql with -j10
Original comment by cheepe...@gmail.com
on 10 Sep 2010 at 9:44
Attachments:
Again mysql with -j14. I have checked out your trunk, but it's quite late for
me so I'll continue tomorrow :)
Original comment by cheepe...@gmail.com
on 10 Sep 2010 at 9:50
Attachments:
Hey. I have tested my patch towards r730, and found no problem with it. The
behavior of distccmon-gnome was correct. I have no explanation on why your test
went wrong, except the obvious ones: patch was not actually applied, or your
sources were tainted with some other patch or other modifications.
To clarify: my patch refers to the pure distcc trunk / tag 3.1; I did not apply
any of the other patches in this thread but the one I submitted, namely
distcc-3.1-fix-slots.patch
I cannot do more except instruct you to try again. Perhaps open the patch file
with a text editor and read the long comment I placed there. Read the code and
validate my fix.
I am on the watcher list of this issue and I'll monitor it. I'm sure we'll be
able to make it work :)
Original comment by cheepe...@gmail.com
on 12 Sep 2010 at 3:03
cheepeero, could you attach your config.h? Thanks.
Original comment by jeremy.w...@gmail.com
on 16 Sep 2010 at 10:12
Here is the config.h.
However, in the meantime I have used distcc without the software limitation in
/etc/distcc/hosts and once the localhost/0 slot got multiplied three times.
localhost pink
I have used localhost/2 pink/8 with my patch for quite a while now (see date of
my post) without bumping into this problem, so I assume this has something to
do with slot allocation, not with transmitting the correct slot to the monitor
(which is what my patch fixes).
Original comment by cheepe...@gmail.com
on 17 Sep 2010 at 8:07
Attachments:
Interesting. I'll have a look when I'm feeling better. Our config.h files are
the same apart from avahi. Cheers.
Original comment by jeremy.w...@gmail.com
on 18 Sep 2010 at 6:09
Hi cheepeero. I just tested your patch again with and without the /LIMIT
option and duplicate localhost entries appeared each time. I tested by
compiling the Linux kernel with make -j6 and a hosts file of "localhost prison".
The monitor reads slot information from the state files saved by the client,
there is no separate channel for transmission. So whatever appears in the
monitor is whatever the clients are 'thinking'. :)
Original comment by jeremy.w...@gmail.com
on 24 Sep 2010 at 5:46
I understand that the slot info is inferred by the client, not reported by the
server. I was referring to distcc client - distcc monitor communication through
state files.
Even so, this does not change the fact that the same member,
dcc_task_state.slot, is used both for accounting the remote compiling slot and
the local preprocessor slot, and the latter is the last write in my_state.slot
(or my_state->slot within your patch).
Let me elaborate:
At line 550 in compile.c, dcc_build_somewhere sets my_state.slot at -1
(unallocated, is suppose). A few lines below it calls
dcc_pick_host_from_list_and_lock_it, which at line 104 calls dcc_lock_one. When
finding a free host/cpu, this function writes the found i_cpu (slot) variable
with dcc_note_state_slot.
When returning to dcc_build_somewhere in compile.c, the flow goes to
dcc_lock_local_cpp at line 567, which in where.c also calls dcc_lock_one, this
time with a fake "localhost" list. This time dcc_lock_one finds another i_cpu
relative to the fake "localhost" list, which overwrites the previous remote
slot selection. So the remote slot initially chosen for compiling the code is
lost in my_state at this phase, replaced by the local preprocessor slot.
I have found that these are the only two places where my_state.slot is
modified. The slot variable in my_state also seems to only be only used for the
monitoring clients, not host selection or such.
I have examined your patch, and I think it should solve this particular slot
accounting bug. I think mine would do so too, except yours would report
preprocessing on the localhost slot it is performed on, while mine would report
it on the remote slot where compilation is scheduled.
However, both our patches seem to have the same problem:
dcc_lock_local_cpp is looking at a different dcc_hostdef list (see line 194 in
where.c), namely dcc_lock_local_cpp, which has 8 slots by default, and is only
overwritten by option "--localslots_cpp".
Also, dcc_lock_local called at line 763 in compile.c examines a different
localhost list (dcc_hostdef_local set by "--localslots") than the normal
"localhost" host added by /etc/distcc/hosts or otherwise. This one has 4 slots
hardcoded as default.
See dcc_parse_hosts at line 468 in hosts.c or just grep the sources for the
variables I have pointed to.
Under these circumstances I feel it is normal for the monitor to display up to
4 or 8 slots for localhost. But with any of our patches, the lock system should
prevent having duplicate host+slot entries. Did you observe duplicates or just
an increased number of localhost slots?
I apologize for the long post, but I felt that the intention of my patch was
not understood - basically it's a different fix (more trivial and less clean)
than yours.
Original comment by cheepe...@gmail.com
on 25 Sep 2010 at 11:42
Right, that explains why I occasionally see four slots used on localhost when I
expect the limit to be two.
However, it's not the problem. When I use your patch, I still see duplicate
entries: the same localhost-slot combination listed multiple times.
Doesn't your patch force the client to create the second lock on localhost with
the same slot as the first lock found on the remote host? Which is a reversal
of the original problem that the second lock clobbered the first slot value.
Now instead of duplicate remote hosts there are duplicate localhosts.
Original comment by jeremy.w...@gmail.com
on 26 Sep 2010 at 4:24
No, my changes only stop the local preprocessing slot to overwrite the initial
remote host/slot selection. As far as I understood the distcc code, the lock
files are different from state files.
Here's a scenario without patch: somehost/6 is selected in the first call of
dcc_lock_one, and gets recorded in my_state. Next dcc_lock_local_cpp locks slot
localhost/2 for preprocessing, but only records the slot, so from now on the
state file contains somehost/2 and remains so until it gets deleted. With my
patch the second write to my_state.slot is ignored, so both the preprocessing
and the compilation phase appear to happen on somehost/6.
As far as I understand your patch, the preprocessing phase would move the whole
my_state from remote_state to local_state, and record preprocessing on
localhost/2, but I am unsure that after this phase is finished the client
records the switch from local_state to remote_state (or back) while the file is
getting compiled remotely and the localhost/2 slot is unlocked for other tasks.
This might lead to duplicating localhost/2 if it gets reused in the meantime.
Unfortunately I had to move from my previous location so I won't be able to use
distcc for a few weeks - no more pink/8 to help my poor notebook :) Perhaps
I'll find another setup to test for duplicate localhost+host entries.
After this discussion I feel your implementation is better than my idea,
because it describes more accurately what actually happens (preprocessing on
localhost etc.) I will also switch to your patch and see how it works when my
setup becomes available again.
Original comment by cheepe...@gmail.com
on 26 Sep 2010 at 10:21
Sorry, I said "lock" but I meant the state file. Makes me wonder if it's
really necessary to have separate lock and state files?
Hmmm, see, I thought your patch would cause it to lock and "note" somehost/6,
then lock localhost/2 but note localhost/6. Most of the time it doesn't,
because preprocessing and compiling happen on the remote host. But if the
remote fails for some reason, then falling back to localhost does cause it note
localhost with the remote slot, causing duplication.
Anyway, if we agree that my patch design is the way to go, then we can work on
making it as good as can be and get this bug resolved. You can always run a
local distcc server and configure your client to use it. Maybe something like
"localhost/1 shadow/1" where "shadow" is the hostname of your localhost.
I'll look into the possible source of duplication using my patch soon. Cheers.
Original comment by jeremy.w...@gmail.com
on 27 Sep 2010 at 1:55
Here's a revised patch that tidies it up, but most importantly changes the
target of dcc_note_state() in dcc_wait_for_cpp(). Cheers.
Original comment by jeremy.w...@gmail.com
on 29 Sep 2010 at 2:26
Attachments:
I have tested your last patch during my last gentoo upgrade (48 packages) and
found no duplicate host+slot entries. The localhost host jumped up to 4 slots
even if my notebook has 2 logical CPUs, and I think this is due to the preset
preprocessor slots. I am unsure if some rare race conditions between the
monitor and distcc may still provoke duplicates, but it seems unlikely.
I believe the patch should be applied on the distcc trunk, since the situation
presented by the distcc monitor (text and gtk-based) with it is a lot better
than without it. It also corrects the mess created by the bug overwriting the
remote slot with the local preprocessor slot. What I like most is that I can
now see how busy my local host is with preprocessing.
I also hope distcc developers read and take into account my thoughts :)
Original comment by cheepe...@gmail.com
on 8 Oct 2010 at 6:05
Thanks heaps, everyone! Thanks particularly to Jeremy and Cheepeero for your
patches, review, and testing, and also to Pankaj for the earlier work-around
patch, and to jsjuni56 for first reporting the issue.
I have applied Jeremy's last patch ("distcc-r731_state.patch") to the distcc
svn repository; it is revision 732.
Original comment by fer...@google.com
on 8 Oct 2010 at 6:31
Original comment by fergus.h...@gmail.com
on 8 Oct 2010 at 6:31
Cool, I'm glad we could get it resolved.
Cheepero, are you interested in working with me on issue #24? I've started to
play around with it and got some promising results but I could use some help.
Cheers.
Original comment by jeremy.w...@gmail.com
on 15 Oct 2010 at 8:51
Sure, if I can. I have starred #24 :)
Original comment by cheepe...@gmail.com
on 15 Oct 2010 at 10:52
This bug seems to have come back as of 3.1:
distcc 3.1 x86_64-pc-linux-gnu
(protocols 1, 2 and 3) (default port 3632)
built Feb 17 2012 13:04:11
Built from source via Gentoo sys-devel/distcc-3.1-r5
Original comment by whim...@gmail.com
on 6 May 2012 at 6:36
Attachments:
3.1 is too old to have this patch. It is included in 3.2.
Original comment by jeremy.w...@gmail.com
on 7 May 2012 at 12:15
Original issue reported on code.google.com by
jsjun...@gmail.com
on 25 Jan 2009 at 7:37Attachments: