sid24ss / distcc

Automatically exported from code.google.com/p/distcc
GNU General Public License v2.0
0 stars 0 forks source link

distcc-mon-gnome displays multiple rows for same host/slot #36

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Answering the following questions is a big help:

1. What version of distcc are you using (e.g. "2.7.1")?  You can run
"distcc --version" to see.  If you got distcc from a distribution package
rather than building from source, please say which one.

distcc 3.0 i686-pc-linux-gnu
  (protocols 1, 2 and 3) (default port 3632)
  built Dec 18 2008 21:34:13

Built from source using Gentoo package sys-devel/distcc-3.0-r4.

2. What platform are you running on (e.g. "Red Hat 8.0", "HP-UX 11.11")? 
What compilare are you using ("gcc 3.3")?  Run "uname -a" and "cc
--version" to see.

Linux tidal 2.6.23-gentoo-r9 #2 SMP Tue Aug 26 20:20:08 PDT 2008 i686
Intel(R) Core(TM)2 Duo CPU E4600 @ 2.40GHz GenuineIntel GNU/Linux

gcc (GCC) 4.1.2 (Gentoo 4.1.2 p1.1)

3. What were you trying to do (e.g. "install distcc", "build Mozilla")?

Compile almost anything with distcc.

4. What went wrong?  Did you get an error message, did it hang, did it
build a program that didn't work, did it not distribute compilation to
machines that ought to get it?

distcc-mon-gnome displays multiple rows for the same host/slot combination.
This appears to occur only for remote hosts. See attached screenshot.

5. If you have an example of a compiler invocation that failed, quote it,
in full e.g.:
   distcc gcc -DHAVE_CONFIG_H -D_GNU_SOURCE -I./src \
"-DSYSCONFDIR=\"/etc/\"" -I./lzo -g -O2 -W -Wall -W \ -Wimplicit -Wshadow
-Wpointer-arith -Wcast-align \ -Wwrite-strings -Waggregate-return
-Wstrict-prototypes \ -Wmissing-prototypes -Wnested-externs -o src/clirpc.o
\ -c src/clirpc.c

Compiling works fine.

6. What error logging do you get?  Turn on client and server error logging.
 On the client, set these environment variables, and try to reproduce the
problem: =export DISTCC_VERBOSE=1 DISTCC_LOG=/tmp/distcc.log=.  Start the
server with the --verbose option. If the problem is intermittent, leave
logging enabled and then pull out the lines from the log file when the
problem recurs.

No error messages.

7. If you got an error message on stderr, quote that error exactly. Find
the lines in the log files pertaining to the compile, and include all of
them in your report, by looking at the process ID in square brackets. If
you can't work that out, quote the last few hundred lines leading up to the
failure. 

No error messages.

Original issue reported on code.google.com by jsjun...@gmail.com on 25 Jan 2009 at 7:37

Attachments:

GoogleCodeExporter commented 9 years ago
I am seeing the same problem, but with distcc 3.1.

Original comment by rustysawdust@gmail.com on 18 Feb 2009 at 11:45

GoogleCodeExporter commented 9 years ago
The whole process of reading multiple state files that get created and deleted 
by the
monitoring programs is hoaky.  It is possible that in the process of taking a
snapshot of the entries of the .distcc/state directory, multiple compiles on 
the same
host/slot pair get triggered.  Note that both the text based and the GUI based
monitor suffer from the same problem, since they use the same underlying 
function
dcc_mon_poll.

A quick fix for this is to ignore multiple compiles on the same host/slot pair, 
for
which I have attached a patch below.  This is not an ideal solution, but should 
do
the job.

Original comment by pankaj.c...@gmail.com on 26 Aug 2009 at 3:58

Attachments:

GoogleCodeExporter commented 9 years ago
I'm pretty sure that I have found that actual cause of the bug.  Will upload a 
patch as soon as I solve a few issues in it.

Original comment by jeremy.w...@gmail.com on 20 Jun 2010 at 2:45

GoogleCodeExporter commented 9 years ago
OK.  So, the bug appears to be in the client, in the handling of the my_state 
variable.  Since 3.0 introduced the possibility of a file being partly 
processed locally and remotely, but the state information was all being stored 
in my_state, resulting in mixed up data.  So the lock files are created 
correctly, but the state files are not.

This fix simply creates a second state variable, differentiating between local 
and remote, and stores state data in the appropriate one.  When 
dcc_write_state() is called, it uses whichever state variables was last updated.

It works for me, so I would love to hear if it works for you!  Cheers.

PS. This is the first time that I have looked at the distcc codebase, so the 
usual disclaimers of naivety apply: I may have inadvertently created a 
monstrous new bug!

Original comment by jeremy.w...@gmail.com on 21 Jun 2010 at 3:28

Attachments:

GoogleCodeExporter commented 9 years ago
Observed the bug last night for the first time since I have been running the 
patched client.  Didn't see where it happened, so am not able to reproduce it.

Original comment by jeremy.w...@gmail.com on 27 Jun 2010 at 6:40

GoogleCodeExporter commented 9 years ago
I have been bugged by this particular bug for a long while. Finally I had the 
time and energy to investigate, and found this to be the problem:

dcc_build_somewhere calls dcc_pick_host_from_list_and_lock_it, which calls 
dcc_lock_one. This particular function sets my_state.slot to whatever slot is 
available from whatever host is picked. However, when returning to 
dcc_build_somewhere the code gets into dcc_lock_local_cpp, which again calls 
dcc_lock_one for localhost. This, in turn, overwrites the previously stored 
slot with the local slot chosen for preprocessing. Finally, in 
dcc_lock_local_cpp, the state is committed to disk for the monitor to pick it 
up, but with the remote slot and local slot values.

I came to this conclusion after reading jeremy's post, but my patch does less: 
let my_state.slots be set just once, by the first call to 
dcc_pick_host_from_list_and_lock_it. It appears to work.

Judging by how communication between the distcc host/slot allocation and the 
monitor state reading is done, I think getting duplicate host+slot entries is 
possible if the monitor is reading the state files while one (or more) of the 
processes are modifying the files, but this should occur rarely.

Hope this helps :)

Original comment by cheepe...@gmail.com on 6 Jul 2010 at 1:11

Attachments:

GoogleCodeExporter commented 9 years ago
Hi cheepeero, I just tested your patch on rev 728 and it actually made the 
symptoms worse: lots of duplicate localhost entries appeared within seconds.  
Have you tested it recently?  Cheers.

Original comment by jeremy.w...@gmail.com on 10 Sep 2010 at 8:17

GoogleCodeExporter commented 9 years ago
I will have to check what is different with rev 728; as the name suggests my 
patch was made against distcc-3.1. I'll also investigate if the gentoo distro 
also aplies some other patches as well.

I use distcc for building my gentoo systems; with the patch all behave well 
(one single-core, 1 dual core and one with 8 logical cores). Here is a 
screenshot of how it looks with my patch on the dual-core while building glib.

Another thing that might cause a difference is that I have limited on each 
system the localhost slots in /etc/distcc/hosts.

Original comment by cheepe...@gmail.com on 10 Sep 2010 at 9:39

Attachments:

GoogleCodeExporter commented 9 years ago
On the previous run I had -j8 left in the gentoo configuration. Here is another 
screenshot on mysql with -j10

Original comment by cheepe...@gmail.com on 10 Sep 2010 at 9:44

Attachments:

GoogleCodeExporter commented 9 years ago
Again mysql with -j14. I have checked out your trunk, but it's quite late for 
me so I'll continue tomorrow :)

Original comment by cheepe...@gmail.com on 10 Sep 2010 at 9:50

Attachments:

GoogleCodeExporter commented 9 years ago
Hey. I have tested my patch towards r730, and found no problem with it. The 
behavior of distccmon-gnome was correct. I have no explanation on why your test 
went wrong, except the obvious ones: patch was not actually applied, or your 
sources were tainted with some other patch or other modifications.

To clarify: my patch refers to the pure distcc trunk / tag 3.1; I did not apply 
any of the other patches in this thread but the one I submitted, namely 
distcc-3.1-fix-slots.patch

I cannot do more except instruct you to try again. Perhaps open the patch file 
with a text editor and read the long comment I placed there. Read the code and 
validate my fix.

I am on the watcher list of this issue and I'll monitor it. I'm sure we'll be 
able to make it work :)

Original comment by cheepe...@gmail.com on 12 Sep 2010 at 3:03

GoogleCodeExporter commented 9 years ago
cheepeero, could you attach your config.h?  Thanks.

Original comment by jeremy.w...@gmail.com on 16 Sep 2010 at 10:12

GoogleCodeExporter commented 9 years ago
Here is the config.h.

However, in the meantime I have used distcc without the software limitation in 
/etc/distcc/hosts and once the localhost/0 slot got multiplied three times.

localhost pink

I have used localhost/2 pink/8 with my patch for quite a while now (see date of 
my post) without bumping into this problem, so I assume this has something to 
do with slot allocation, not with transmitting the correct slot to the monitor 
(which is what my patch fixes).

Original comment by cheepe...@gmail.com on 17 Sep 2010 at 8:07

Attachments:

GoogleCodeExporter commented 9 years ago
Interesting.  I'll have a look when I'm feeling better.  Our config.h files are 
the same apart from avahi.  Cheers.

Original comment by jeremy.w...@gmail.com on 18 Sep 2010 at 6:09

GoogleCodeExporter commented 9 years ago
Hi cheepeero.  I just tested your patch again with and without the /LIMIT 
option and duplicate localhost entries appeared each time.  I tested by 
compiling the Linux kernel with make -j6 and a hosts file of "localhost prison".

The monitor reads slot information from the state files saved by the client, 
there is no separate channel for transmission.  So whatever appears in the 
monitor is whatever the clients are 'thinking'.  :)

Original comment by jeremy.w...@gmail.com on 24 Sep 2010 at 5:46

GoogleCodeExporter commented 9 years ago
I understand that the slot info is inferred by the client, not reported by the 
server. I was referring to distcc client - distcc monitor communication through 
state files.

Even so, this does not change the fact that the same member, 
dcc_task_state.slot, is used both for accounting the remote compiling slot and 
the local preprocessor slot, and the latter is the last write in my_state.slot 
(or my_state->slot within your patch).

Let me elaborate:

At line 550 in compile.c, dcc_build_somewhere sets my_state.slot at -1 
(unallocated, is suppose). A few lines below it calls 
dcc_pick_host_from_list_and_lock_it, which at line 104 calls dcc_lock_one. When 
finding a free host/cpu, this function writes the found i_cpu (slot) variable 
with dcc_note_state_slot.

When returning to dcc_build_somewhere in compile.c, the flow goes to 
dcc_lock_local_cpp at line 567, which in where.c also calls dcc_lock_one, this 
time with a fake "localhost" list. This time dcc_lock_one finds another i_cpu 
relative to the fake "localhost" list, which overwrites the previous remote 
slot selection. So the remote slot initially chosen for compiling the code is 
lost in my_state at this phase, replaced by the local preprocessor slot.

I have found that these are the only two places where my_state.slot is 
modified. The slot variable in my_state also seems to only be only used for the 
monitoring clients, not host selection or such.

I have examined your patch, and I think it should solve this particular slot 
accounting bug. I think mine would do so too, except yours would report 
preprocessing on the localhost slot it is performed on, while mine would report 
it on the remote slot where compilation is scheduled.

However, both our patches seem to have the same problem:

dcc_lock_local_cpp is looking at a different dcc_hostdef list (see line 194 in 
where.c), namely dcc_lock_local_cpp, which has 8 slots by default, and is only 
overwritten by option "--localslots_cpp". 

Also, dcc_lock_local called at line 763 in compile.c examines a different 
localhost list (dcc_hostdef_local set by "--localslots") than the normal 
"localhost" host added by /etc/distcc/hosts or otherwise. This one has 4 slots 
hardcoded as default. 

See dcc_parse_hosts at line 468 in hosts.c or just grep the sources for the 
variables I have pointed to.

Under these circumstances I feel it is normal for the monitor to display up to 
4 or 8 slots for localhost. But with any of our patches, the lock system should 
prevent having duplicate host+slot entries. Did you observe duplicates or just 
an increased number of localhost slots?

I apologize for the long post, but I felt that the intention of my patch was 
not understood - basically it's a different fix (more trivial and less clean) 
than yours.

Original comment by cheepe...@gmail.com on 25 Sep 2010 at 11:42

GoogleCodeExporter commented 9 years ago
Right, that explains why I occasionally see four slots used on localhost when I 
expect the limit to be two.

However, it's not the problem.  When I use your patch, I still see duplicate 
entries: the same localhost-slot combination listed multiple times.

Doesn't your patch force the client to create the second lock on localhost with 
the same slot as the first lock found on the remote host?  Which is a reversal 
of the original problem that the second lock clobbered the first slot value.  
Now instead of duplicate remote hosts there are duplicate localhosts.

Original comment by jeremy.w...@gmail.com on 26 Sep 2010 at 4:24

GoogleCodeExporter commented 9 years ago
No, my changes only stop the local preprocessing slot to overwrite the initial 
remote host/slot selection. As far as I understood the distcc code, the lock 
files are different from state files.

Here's a scenario without patch: somehost/6 is selected in the first call of 
dcc_lock_one, and gets recorded in my_state. Next dcc_lock_local_cpp locks slot 
localhost/2 for preprocessing, but only records the slot, so from now on the 
state file contains somehost/2 and remains so until it gets deleted. With my 
patch the second write to my_state.slot is ignored, so both the preprocessing 
and the compilation phase appear to happen on somehost/6.

As far as I understand your patch, the preprocessing phase would move the whole 
my_state from remote_state to local_state, and record preprocessing on 
localhost/2, but I am unsure that after this phase is finished the client 
records the switch from local_state to remote_state (or back) while the file is 
getting compiled remotely and the localhost/2 slot is unlocked for other tasks. 
This might lead to duplicating localhost/2 if it gets reused in the meantime.

Unfortunately I had to move from my previous location so I won't be able to use 
distcc for a few weeks - no more pink/8 to help my poor notebook :) Perhaps 
I'll find another setup to test for duplicate localhost+host entries. 

After this discussion I feel your implementation is better than my idea, 
because it describes more accurately what actually happens (preprocessing on 
localhost etc.) I will also switch to your patch and see how it works when my 
setup becomes available again.

Original comment by cheepe...@gmail.com on 26 Sep 2010 at 10:21

GoogleCodeExporter commented 9 years ago
Sorry, I said "lock" but I meant the state file.  Makes me wonder if it's 
really necessary to have separate lock and state files?

Hmmm, see, I thought your patch would cause it to lock and "note" somehost/6, 
then lock localhost/2 but note localhost/6.  Most of the time it doesn't, 
because preprocessing and compiling happen on the remote host.  But if the 
remote fails for some reason, then falling back to localhost does cause it note 
localhost with the remote slot, causing duplication.

Anyway, if we agree that my patch design is the way to go, then we can work on 
making it as good as can be and get this bug resolved.  You can always run a 
local distcc server and configure your client to use it.  Maybe something like 
"localhost/1 shadow/1" where "shadow" is the hostname of your localhost.

I'll look into the possible source of duplication using my patch soon.  Cheers.

Original comment by jeremy.w...@gmail.com on 27 Sep 2010 at 1:55

GoogleCodeExporter commented 9 years ago
Here's a revised patch that tidies it up, but most importantly changes the 
target of dcc_note_state() in dcc_wait_for_cpp().  Cheers.

Original comment by jeremy.w...@gmail.com on 29 Sep 2010 at 2:26

Attachments:

GoogleCodeExporter commented 9 years ago
I have tested your last patch during my last gentoo upgrade (48 packages) and 
found no duplicate host+slot entries. The localhost host jumped up to 4 slots 
even if my notebook has 2 logical CPUs, and I think this is due to the preset 
preprocessor slots. I am unsure if some rare race conditions between the 
monitor and distcc may still provoke duplicates, but it seems unlikely.

I believe the patch should be applied on the distcc trunk, since the situation 
presented by the distcc monitor (text and gtk-based) with it is a lot better 
than without it. It also corrects the mess created by the bug overwriting the 
remote slot with the local preprocessor slot. What I like most is that I can 
now see how busy my local host is with preprocessing.

I also hope distcc developers read and take into account my thoughts :)

Original comment by cheepe...@gmail.com on 8 Oct 2010 at 6:05

GoogleCodeExporter commented 9 years ago
Thanks heaps, everyone!  Thanks particularly to Jeremy and Cheepeero for your 
patches, review, and testing, and also to Pankaj for the earlier work-around 
patch, and to jsjuni56 for first reporting the issue.

I have applied Jeremy's last patch ("distcc-r731_state.patch") to the distcc 
svn repository; it is revision 732.

Original comment by fer...@google.com on 8 Oct 2010 at 6:31

GoogleCodeExporter commented 9 years ago

Original comment by fergus.h...@gmail.com on 8 Oct 2010 at 6:31

GoogleCodeExporter commented 9 years ago
Cool, I'm glad we could get it resolved.

Cheepero, are you interested in working with me on issue #24?  I've started to 
play around with it and got some promising results but I could use some help.  
Cheers.

Original comment by jeremy.w...@gmail.com on 15 Oct 2010 at 8:51

GoogleCodeExporter commented 9 years ago
Sure, if I can. I have starred #24 :)

Original comment by cheepe...@gmail.com on 15 Oct 2010 at 10:52

GoogleCodeExporter commented 9 years ago
This bug seems to have come back as of 3.1:

distcc 3.1 x86_64-pc-linux-gnu
  (protocols 1, 2 and 3) (default port 3632)
  built Feb 17 2012 13:04:11

Built from source via Gentoo sys-devel/distcc-3.1-r5

Original comment by whim...@gmail.com on 6 May 2012 at 6:36

Attachments:

GoogleCodeExporter commented 9 years ago
3.1 is too old to have this patch.  It is included in 3.2.

Original comment by jeremy.w...@gmail.com on 7 May 2012 at 12:15