Closed e14f4152-4982-4ace-8c95-73a0599b109b closed 14 years ago
Attachment: singular-3.1.1.4-libfac.patch.gz
patch for libfac makefile
François has posted a patch at #9733 for the libfac problem. We could merge it into a new singular-3-1-1-4.p3.spkg here.
Patch attached with a slightly more descriptive name.
Description changed:
---
+++
@@ -38,3 +38,11 @@
make[2]: Leaving directory `/var/tmp/portage/sci-mathematics/singular-3.1.1.4-r1/work/Singular-3-1-1/libfac'
make[1]: *** [install] Error 1
+
+A workaround is to run
+
+sh +$ cd SAGE_ROOT +$ ./sage -f spkg/standard/singular-3-1-1-4.p2.spkg +
+and to restart the build.
Patch again looks reasonable.
I only wonder how often we'll again have to create a new Singular spkg for almost the "same" reason... ;-)
Funny that the new race condition showed up with just two make
jobs, after so much excessive testing by Dave and others.
Replying to @nexttime:
Patch again looks reasonable.
I only wonder how often we'll again have to create a new Singular spkg for almost the "same" reason... ;-)
Yes, it's a bit worrying.
Funny that the new race condition showed up with just two
make
jobs, after so much excessive testing by Dave and others.
I expect it depends on very much on the time a compiler takes to compile code and for the linker to link it. On any one system, if file foo.c
takes longer to compile than file foobar.c
then that's likely to be the same irrespective of the system load.
I was wondering about the possibility of using a wrapper script for gcc, such that there was a random delay between 0 and 50 ms, before gcc actually started compiling. The nanosleep
function should provide that possibility. Then implement something similar for the linker so there was a random delay before it started linking. Of course that would slow the build process, but would probably have a higher probability of inducing build failures.
One would need to think carefully about how to seed the random number generators then. Probably using /dev/random
would be best.
It would be some effort to write the code for this, but once written it should make detection of race conditions much easier for any bit of Sage.
Another idea, which may help would be to randomly change the number of processors online for those of us with multi-processor machines. I'm less certain about whether that would be useful, though it be be very much easier to implement. Although this machine is only quad core, it's hyperthreaded, so I could have from one to eight threads active at any one time.
I suspect a search of Google would find other similar (but hopefully better) methods of inducing effects that are likely to uncover race conditions.
Dave
I was thinking of similar testing w.r.t. spkg/standard/deps
, i.e. inter-spkg dependencies, where adding (not necessarily random) sleep
would suffice, especially to the build (time) of packages that get early and almost instantly built.
For Makefiles, IMHO careful reading (and / or a "makefile-lint") would be better, which is obviously harder [to implement]. Unfortunately, the steps actually taken in a build also heavily depend on the system and its configuration, so I expect it to be undecidable in general. Nevertheless, one could catch at least some missing dependencies without "brute-force" trial and error.
P.S.: For deps
, I'd prefer generating it from formal (dependency) specs given in (or for) the spkgs, like every proper package system / installer does. These usually also include specs of the required versions.
Sorry, a bit off-ticket-topic.
Is there already a spkg around? (Sorry, I was out of office and off.) I put one with fbissey's patch here: http://sage.math.washington.edu/home/dreyer/spkg/singular-3-1-1-4.p3.spkg
By all means someone review this, as I expect it fixes the bug.
Within the next 24 hours I will have implemented a random delay in front of gcc and will build this a few hundred times with random delays. That might help discover any more similar issues, if they exist.
Dave
The changelog lacks an entry for the p3.
François, perhaps you could ask the original reporter to test this, too?
(Such that we don't run into the next race condition after this ticket has been merged.)
His/her system appears to be a good test platform... ;-)
Replying to @nexttime:
The changelog lacks an entry for the p3.
François, perhaps you could ask the original reporter to test this, too?
(Such that we don't run into the next race condition after this ticket has been merged.)
His/her system appears to be a good test platform... ;-)
I posted the patch only after he reported back that it worked for him. You could say that he has a good test platform. Single core system, he has a 64bit base system and a 32bit system in a chroot. In this particular case the problem only appeared in the 32bit chroot.
He reported quite a number of issues for the sage-on-gentoo port over the years.
Replying to @kiwifb:
I posted the patch only after he reported back that it worked for him.
Sorry, you already said so, forgot that.
Dave, if you don't want to mess around with /dev/urandom
(and hd
?), you could use the pid of a subshell (rand_raw=`sh -c 'echo $$'`
) as the source for a pseudo random number (perhaps "recursively", i.e. the pid of the nth subshell where n is itself a pseudo-random number), and feed that to
sleep `expr $base + $rand_raw % $modulus \* $scale`
Plain rand_raw
is of course bad on an otherwise idle system, also depending on how long you sleep.
Minimalistic C program (not very robust) to implement nanosleep
analogous to sleep(1)
.
Attachment: nanosleep.c.gz
Dave, I've attached a minimalistic C implementation of a command nanosleep <nanoseconds>
.
Causes a random delay up to a maximum of that set by an enviroment variable MAX_COMPILER_DELAY_IN_MICRO_SECONDS
Attachment: randomsleep.c.gz
I thought I'd commented on the file I added, but I forgot.
I'd already done on this before Leif posted his file, though we both made use of nanosleep()
Rather than take an argument like Leif's, mine reads the value of the environment variable MAX_COMPILER_DELAY_IN_MICRO_SECONDS
. It reads from /dev/urandom
What I'm less sure about, is what is a sensible maximum delay to use. That's a difficult question to answer I think. Any ideas?
Replying to @nexttime:
The changelog lacks an entry for the p3.
I fixed that (typical late-night copy&paste mistake)
http://sage.math.washington.edu/home/dreyer/spkg/singular-3-1-1-4.p3.spkg (same location as above)
if (sizeof(int) != 4) {
fprintf(stderr,"Your system is odd. On everying except a Cray X MP, an int is 4 bytes\n");
fprintf(stderr,"Exiting...\n");
exit(3);
}
LOL, ever heard of 16-bit systems (or compilers)? And even some old compilers on 32-bit systems had 2-byte int
s IIRC (of course that's odd). I can't await having sizeof(int)>=16
... ;-)
The allocation of delay_as_string
is useless (actually a space leak :) ) unless you str[n]cpy()
the result of getenv(...)
to it, but who cares.
Replying to @nexttime:
if (sizeof(int) != 4) { fprintf(stderr,"Your system is odd. On everying except a Cray X MP, an int is 4 bytes\n"); fprintf(stderr,"Exiting...\n"); exit(3); }
LOL, ever heard of 16-bit systems (or compilers)? And even some old compilers on 32-bit systems had 2-byte
int
s IIRC (of course that's odd). I can't await havingsizeof(int)>=16
... ;-)
No, but when I was testing http://atlc.sourceforge.net/ for portability, I found on a Cray X MP running Unicos that:
The latter was a real pain to me.
So nothing would totally surprise me.
I've now set this up compiling in 5 different directories on machine
I suspect that the first of these is too long, so will slow builds considerably and the last is too short, so will not be too different from no delay.
The allocation of
delay_as_string
is useless (actually a space leak :) ) unless youstr[n]cpy()
the result ofgetenv(...)
to it, but who cares.
Yes, you are correct. I'll change that, but it's not a big deal for now. It's a start.
I don't know how this would fit in with Nathann's:
... though my way of seeing it would really be to avoid spending hours wondering "how it could fail" when it seems so easy to just test it and write patches when errors are reported.
Somehow I doubt it would fit in too well.
If I don't get any problems, and there are are problems found by others, it would make me wonder how else one could try to find such problems. This was my best stab at an answer, but perhaps it is not ideal.
I think really there should be a delay on the linker too, which I have not implemented. I've only done this on the compilers gcc, g++ and gfortran.
Dave
Replying to @sagetrac-drkirkby:
[...] when I was testing http://atlc.sourceforge.net/ for portability, I found on a Cray X MP running Unicos that:
- sizeof(long)=8
- sizeof(int)=8
- sizeof(short)=8
The latter was a real pain to me.
IIRC the C standard requires
sizeof(short) <= sizeof(int) <= sizeof(long)
but also
sizeof(short) < sizeof(long)
so Cray would violate the latter.
I don't know how this would fit in with Nathann's:
... though my way of seeing it would really be to avoid spending hours wondering "how it could fail" when it seems so easy to just test it and write patches when errors are reported.
Somehow I doubt it would fit in too well.
If I don't get any problems, and there are are problems found by others, it would make me wonder how else one could try to find such problems. This was my best stab at an answer, but perhaps it is not ideal.
You mean I should upload a reviewer patch? ;-)
I think really there should be a delay on the linker too, which I have not implemented. I've only done this on the compilers gcc, g++ and gfortran.
The linker isn't often used directly, and libtool
calls gcc
for linking:
ld -shared -o p_Procs_FieldIndep.so p_Procs_Lib_FieldIndep.dl_o
ld -shared -o p_Procs_FieldZp.so p_Procs_Lib_FieldZp.dl_o
ld -shared -o p_Procs_FieldQ.so p_Procs_Lib_FieldQ.dl_o
ld -shared -o p_Procs_FieldGeneral.so p_Procs_Lib_FieldGeneral.dl_o
ld -shared -o dbmsr.so ndbm.dl_o sing_dbm.dl_o -lc_nonshared
(That's all.)
Replying to @nexttime:
Replying to @sagetrac-drkirkby:
[...] when I was testing http://atlc.sourceforge.net/ for portability, I found on a Cray X MP running Unicos that:
- sizeof(long)=8
- sizeof(int)=8
- sizeof(short)=8
The latter was a real pain to me.
IIRC the C standard requires
sizeof(short) <= sizeof(int) <= sizeof(long)
but alsosizeof(short) < sizeof(long)
so Cray would violate the latter.
I was not aware of sizeof(short) < sizeof(long)
was part of the C standard. I don't have a copy of the standard. If anyone does, please email me one!
I guess the $15,000,000 Cray X-MP introduced in 1982 pre-dated the latest C99 standard by a few years.
If you want to try a Cray, go to http://www.cray-cyber.org/access/index.php and get an account.
At the time I used the X-MP, (at least I think it was an X-MP, but I see they only have a Y-MP now), it was painfully slow to do anything. You really needed to use the vector processors properly to get the performance, and I had no intension of doing that. But I reckon my code is quite portable, as it must be one of the only programs to have been run on a supercomputer and a Sony Playstation games console! I think Sage would need a good few patches to ever be that portable.
I don't know how this would fit in with Nathann's:
... though my way of seeing it would really be to avoid spending hours wondering "how it could fail" when it seems so easy to just test it and write patches when errors are reported.
Somehow I doubt it would fit in too well.
If I don't get any problems, and there are are problems found by others, it would make me wonder how else one could try to find such problems. This was my best stab at an answer, but perhaps it is not ideal.
You mean I should upload a reviewer patch? ;-)
I think really there should be a delay on the linker too, which I have not implemented. I've only done this on the compilers gcc, g++ and gfortran.
The linker isn't often used directly, and
libtool
callsgcc
for linking:ld -shared -o p_Procs_FieldIndep.so p_Procs_Lib_FieldIndep.dl_o ld -shared -o p_Procs_FieldZp.so p_Procs_Lib_FieldZp.dl_o ld -shared -o p_Procs_FieldQ.so p_Procs_Lib_FieldQ.dl_o ld -shared -o p_Procs_FieldGeneral.so p_Procs_Lib_FieldGeneral.dl_o ld -shared -o dbmsr.so ndbm.dl_o sing_dbm.dl_o -lc_nonshared
(That's all.)
OK. So hopefully random delays on the compiler will work.
Things are looking ok so far. I've built this
If by around 0830 GMT tomorrow, this has not failed, I'll give it a positive review. By that time it should have built nearly 100 times with the longest delay and certainly 100 times with the shorter delays.
Dave
Reviewer: David Kirkby
I had a bit of a shock this morning. Instead of my computer running flat out building Singular, it was relatively idle. This suggested the Singular processes had died. But luckily it was not an error. The loop had a limit of 100, so they had all finished.
So I've now built this version of Singular 500 times, with random delays before calling gcc.
IF there are any further problems found, then I don't know how we go about testing for them. At least random testing has not discovered them.
So positive review.
Dave
Author: Francois Bissey, Alexander Dreyer
Upstream: Fixed upstream, in a later stable release.
@
Dave: Thanks a lot for the extensive tests! (Also in the name of the Singular-Team).
All the newly found dependencies will be added upstream. Some of them are in fact not only unexpected, but also they were not intended that way. (In fact these are bugs, which will be fixed.)
Changed upstream from Fixed upstream, in a later stable release. to Reported upstream. Developers acknowledge bug.
Replying to @alexanderdreyer:
@
Dave: Thanks a lot for the extensive tests! (Also in the name of the Singular-Team).All the newly found dependencies will be added upstream. Some of them are in fact not only unexpected, but also they were not intended that way. (In fact these are bugs, which will be fixed.)
You are welcome. I've changed the tag about "Report Upstream", since these changes are not yet in a stable release of Singular. Look at all the options on that, and determine what's the most appropriate.
Of course it's good I found no flaws, though I must admit it would have been nice to know whether adding random delays before calling gcc/g++ was useful or not. I guess I could try the previous version with random delays and see if I can uncover the same bug as the other person.
Trying to build the whole of Sage with random delays might be useful, but it would not be possible to build it 500 times in any reasonable amount of time.
But I better stop there, as I would only confirm Leif's belief I'm addicted to testing!
Dave
Merged: sage-4.6.alpha2
Changed author from Francois Bissey, Alexander Dreyer to François Bissey, Alexander Dreyer
Description changed:
---
+++
@@ -21,8 +21,8 @@
./bin/install-sh -c -m 644 factoryconf.h /var/tmp/portage/sci-mathematics/singular-3.1.1.4-r1/work/Singular-3-1-1/build/include/factoryconf.h
./bin/install-sh -c -m 644 ./ftmpl_inst.cc /var/tmp/portage/sci-mathematics/singular-3.1.1.4-r1/work/Singular-3-1-1/build/include/templates/ftmpl_inst.cc
for file in ftmpl_array.cc ftmpl_factor.cc ftmpl_functions.h ftmpl_list.cc ftmpl_matrix.cc ftmpl_array.h ftmpl_factor.h ftmpl_list.h ftmpl_matrix.h; do \
- ./bin/install-sh -c -m 644 ./templates/$file /var/tmp/portage/sci-mathematics/singular-3.1.1.4-r1/work/Singular-3-1-1/build/include/templates/$file; \
- done
+ ./bin/install-sh -c -m 644 ./templates/$file /var/tmp/portage/sci-mathematics/singular-3.1.1.4-r1/work/Singular-3-1-1/build/include/templates/$file; \
+ done
ranlib /var/tmp/portage/sci-mathematics/singular-3.1.1.4-r1/work/Singular-3-1-1/build/lib/libsingcf.a
make[2]: Leaving directory `/var/tmp/portage/sci-mathematics/singular-3.1.1.4-r1/work/Singular-3-1-1/factory'
make install in libfac
This is a follow-up to #9733. From Comment 57 by François Bissey at that ticket:
Would you believe it. I committed the latest patch to the sage-on-gentoo tree 3 hours ago and 1 hour ago about one of our users reported a new parallel make failure at -j2 on x86.... in libfac this time.
A workaround is to run
and to restart the build.
Upstream: Reported upstream. Developers acknowledge bug.
CC: @alexanderdreyer @sagetrac-drkirkby @kiwifb @jhpalmieri @nexttime
Component: packages: standard
Author: François Bissey, Alexander Dreyer
Reviewer: David Kirkby
Merged: sage-4.6.alpha2
Issue created by migration from https://trac.sagemath.org/ticket/9946