Closed amirouche closed 4 years ago
@weinholt knows a lot about CI and says Circle and GitLab have more advanced features than Cirrus. But Cirrus has a lot of platforms (Linux/Docker, Windows, Mac, FreeBSD) and is fast and easy. We could start with it. All of these new CI hosts ought to be easy to set up, so if it doesn't work out, it's not hard to change.
I love the ultra-simple UI of Cirrus: https://cirrus-ci.com/github/weinholt/ It's very fast and has only the information you need.
I can make a pull request if you want
That would be great, but let's wait for @feeley to confirm. He probably has to register a Cirrus account and then enable it for the udem-dlteam organization from the GitHub settings.
One thing that would be good to investigate right away, is how publishing of build artifacts works. It would be great to always be able to download gsc and gsi binaries of the latest successful CI build.
Le lun. 4 nov. 2019 à 09:38, Lassi Kortela notifications@github.com a écrit :
One thing that would be good to investigate right away, is how publishing of build artifacts works. It would be great to always be able to download gsc and gsi binaries of the latest successful CI build.
This is the interesting documentation: https://cirrus-ci.org/guide/writing-tasks/#artifacts-instruction
This will require to setup cirrus at gambit/gambit. Maybe it is easier for the time being to build gambit at each CI run?
Related to code coverage: https://github.com/gambit/gambit/issues/94
We could also try to set it up for udem-dlteam/gambit first
Just realized I already have a Cirrus account, so I can add Cirrus to my personal fork of Gambit and find out how it goes. That's the fastest way to find out. If it's good we can then add it to gambit/gambit and this udem-dlteam. If there are problems, we save some account creation hassles and can try another provider.
I also realized I have one and just configured it for udem-dlteam/gambit
to see how it goes.
CI often takes many tries to get right. Since CI services build all branches of the repo, it's usually easiest to do the build experimentation in a topic branch and merge to master once it's working.
BTW how much are the gsc and gsi executables affected by the ./configure --prefix=...
argument? If the CI builds them with a particular prefix, can those binaries be copied onto another computer and still be usable from an arbitrary directory?
CI agent stopped responding! Container errored with 'OOMKilled'
Out of memory, and we're not even building with --enable-single-host
:|
I have created a basic .cirrus.yml
file and is currently executing. For future reference: it is important to pull the tags from the repo first so that the build process knows we need to bootstrap the compiler first instead of immediately compiling the (stale) .c files.
I'm not sure why it is giving a Container errored with 'OOMKilled'
error. Maybe the -j
on the make is too much...
@amirouche @lassik Since you now have write access to both repositories, I will let you continue the experimenting with Cirrus CI. If it looks good then we can move the changes to gambit/gambit
.
We need to verify that Cirrus CI allows checking the main platforms (linux, macOS and Windows), and with the different build settings that are currently done with Travis and Appveyor.
Concerning the --prefix=...
, the build process burns the path into the executable as the default "Gambit installation directory". If you want to use the executable on another machine, you will need to move the bin
, lib
and include
directories, and use an explicit -:~~=<path>
runtime option to inform the runtime system where those directories are. Here is some relevant information from INSTALL.txt
:
If the --prefix option is not used, the default is to install all
files in /usr/local/Gambit and its subdirectories, namely "bin",
"lib", "include", etc. The files that would normally go in these
subdirectories can be redirected to other directories using
the following configure options:
--bindir=DIR executables (gsi, gsc, ...)
--libdir=DIR libraries (libgambit.a, syntax-case.scm, ...)
--includedir=DIR C header files (gambit.h, ...)
--docdir=DIR documentation (gambit.pdf, gambit.html, ...)
--infodir=DIR info documentation (gambit.info, ...)
--datadir=DIR read-only architecture-independent data (gambit.el)
Note that the install target of the makefiles supports the DESTDIR
environment variable which allows staging an install to a specific
directory. The command:
% make install DESTDIR=/Users/feeley/stage
will install all the files in /Users/feeley/stage as though it was the
root of the filesystem.
Thanks! I'm quite excited about this so I can get to it right away. I'll create a lassik/cirrus
branch in udem-dlteam/gambit
for experimenting (IMHO it's clearest if each topic branch name says who is working on it, otherwise old mystery branches tend to be left behind in repos).
Not sure how to deal with the prefix for CI builds. Maybe something like ./configure --prefix=/usr/local/gambit/$CIRRUS_BRANCH
?
Does it matter? Is the filesystem persistent?
For the OOM error, try removing the -j
on the make.
It doesn't matter during the build as the CI file system vanishes afterwards. But we can pick some files to save from each successful build as artifacts so people can download them. It would probably be useful to save the gsc and gsi binaries in this way. If those depend on the prefix, then that prefix would have to be something that is likely to work sensibly on the personal computers of people who download the binaries.
Ok I understand. I guess it could be the commit hash. I guess the $CIRRUS_BRANCH
is related to that. Note that historically I have used /usr/local/Gambit
(capital G).
If /usr/local/Gambit
is already in use we could stick with it. But in case someone has installed a stable release of Gambit in there, we'd do well to think of some prefix that doesn't overwrite that one with a cutting-edge version. The git branch or hash would be good since one can then try multiple experimental branches on the same computer. Is the hash too precise? It would change with every new commit to the same branch. The git tag is probably not precise enough, as bleeding-edge builds are not tagged yet.
$CIRRUS_BRANCH
is just the Git branch. It sets other similar envars also: https://cirrus-ci.org/guide/writing-tasks/#environment-variables
Here's a useful configure option:
--enable-multiple-versions
multiple installed versions are supported
It will put the installation in $prefix/<version>
where version is something like v4.9.3
and will create the symbolic link current
to point to the latest installed version. So using the commit hash as a "version" will not interfere.
Very nice! How is the <version>
set and how does it determine which version is latest?
Without -j
the make
succeeds in Cirrus. But one unit test fails:
[223| 0] 94% ###############. 110.9s 13-modules/prim_port.scm
*** FAILED 13-modules/prim_port.scm WITH EXIT CODE HI=1 LO=0
(call-with-output-process "echo" ##newline) in (include "~~lib/_prim-port#.scm") returned #!void but expected #<os-exception #2>
(with-output-to-process "echo" ##newline) in (include "~~lib/_prim-port#.scm") returned #<os-exception #3> but expected #<os-exception #4>
(call-with-output-process "echo" ##newline) in (namespace ("##")) returned #<os-exception #5> but expected #<os-exception #2>
(with-output-to-process "echo" ##newline) in (namespace ("##")) returned #<os-exception #6> but expected #<os-exception #4>
Hey the tests passed except for 1 out of 237. Don't know what's happening with "echo".
I must say the build machine is underpowered... it took 110 seconds for the tests, and on my 2013 laptop it takes ~10 seconds for a make ut
.
For the failing unit test, the result of (call-with-output-process "echo" ##newline)
is compared to the result of (##call-with-output-process "echo" ##newline)
. It seems the first one is returning an OS exception and the second one is executing fine without an exception. This is very strange as these should be equivalent.
Note that echo
is used because it is a program (or command) that exists on unix and windows, in this particular case a newline is sent to the echo program that ignores it, so nothing bad should happen. It has never caused problems in the past on my development machines, and Travis. Could it be a transient problem? Could it be related to how slow the machine is? Maybe the build should be repeated to make sure...
Hmmm I misread the errors. It seems they all (except for one) fail with an os-exception... Could it be that there's a limit on subprocesses?
I must say the build machine is underpowered... it took 110 seconds for the tests, and on my 2013 laptop it takes ~10 seconds for a
make ut
.
Their test runners are rented on-demand from Google cloud. Since we are using the free service, they probably use some of the slower offerings. It is possible to plug your own build server into Cirrus for free (and to some other CI services, e.g. GitLab). I'm not familiar with the technical details. The paid accounts on these services are presumably faster/have more RAM.
Since some of the tests do I/O, there may be some additional factor causing slowdown - not sure. 10x slowdown is quite a lot.
The FreeBSD jobs also often take a lot of time to get going because there's limited capacity and a queue for the free service. The others don't have FreeBSD at all.
For the failing unit test, the result of
(call-with-output-process "echo" ##newline)
is compared to the result of(##call-with-output-process "echo" ##newline)
. It seems the first one is returning an OS exception and the second one is executing fine without an exception. This is very strange as these should be equivalent.Note that
echo
is used because it is a program (or command) that exists on unix and windows, in this particular case a newline is sent to the echo program that ignores it, so nothing bad should happen. It has never caused problems in the past on my development machines, and Travis. Could it be a transient problem? Could it be related to how slow the machine is? Maybe the build should be repeated to make sure...
Would be quite strange if speed problems cause an exception. I'll add the artifact settings which triggers another build. If the echo still fails, we could add that a debug print to get details about the exception.
Very nice! How is the
<version>
set and how does it determine which version is latest?
The version is $PACKAGE_VERSION
from the configure script. My point is that if the user's Gambit was configures with the --enable-multiple-versions
there will be no conflict if you use --prefix=/usr/local/Gambit/<commit_hash>
.
Perhaps the configure script should have a mechanism to specify the version, other than using $PACKAGE_VERSION
which would be the default. Maybe a --enable-version=<commit_hash>
.
Would be nice to have an --enable-single-host
build as well, but it will probably be extremely slow and possibly run out of RAM. We get what we pay for :)
https://cirrus-ci.org/guide/linux/ says:
Linux Community Cluster is a Kubernetes cluster running on Google Kubernetes Engine that is available free of charge for Open Source community. Paying customers can also use Community Cluster for personal private repositories or buy CPU time with compute credits for their private organization repositories.
Community Cluster is configured the same way as anyone can configure a personal GKE cluster as described here.
By default a container is given 2 CPUs and 4 GB of memory but it can be configured in .cirrus.yml:
Containers on Community Cluster can use maximum 8.0 CPUs and up to 24 GB of memory.
I like the idea of running the CI locally. But we can explore that later if Cirrus gives the features we need.
You should try --enable-single-host
in the next round. As long as there's no parallel build I'm confident it will pass.
Can you estimate how many gigs of RAM it needs?
We can try being greedy and using the maximum of 8 CPUs and 24 GB RAM. Maybe they have a fixed capacity, so if you request higher specs you'll have to wait longer in the queue for the build to start.
Can you estimate how many gigs of RAM it needs?
The memory usage of gcc is probably the limiting factor and it greatly depends on the gcc version. For gcc 8 which I use regularly, I haven't seen more that 1 GB RSS per gcc compilation.
It should be viewed as a bug if 4 GB RAM is not enough to build Gambit (for a non-parallel build).
Good to know. I tried using max CPU and RAM and make -j
- now it went really fast. The log looks like it tries to run configure and compilation thrice (???) and finally fails at something. https://cirrus-ci.com/task/6161948470673408
If we run the CI locally we would be using machines where a make -j
is not a problem and is fast. Here are some build times on some of these machines (for a build including the git clone
):
380 seconds gambit.iro.umontreal.ca recent mac mini with 6 cores
240 seconds arctic.iro.umontreal.ca liquid cooled Debian linux PC
That's a lot better than the ~1200 seconds with the Kubernetes and similar to the max CPU and RAM case (with no guarantees I assume).
For the build issue, could you try removing the -j
? I suspect it is a concurrency issue with the makefiles (probably related to git).
Without
-j
themake
succeeds in Cirrus. But one unit test fails:[223| 0] 94% ###############. 110.9s 13-modules/prim_port.scm *** FAILED 13-modules/prim_port.scm WITH EXIT CODE HI=1 LO=0 (call-with-output-process "echo" ##newline) in (include "~~lib/_prim-port#.scm") returned #!void but expected #<os-exception #2> (with-output-to-process "echo" ##newline) in (include "~~lib/_prim-port#.scm") returned #<os-exception #3> but expected #<os-exception #4> (call-with-output-process "echo" ##newline) in (namespace ("##")) returned #<os-exception #5> but expected #<os-exception #2> (with-output-to-process "echo" ##newline) in (namespace ("##")) returned #<os-exception #6> but expected #<os-exception #4>
I suspect a problem with subprocesses... Is there an easy way to run tests on the Cirrus machine without having to rebuild from scratch? It really makes experimenting painful...
Anyway, please run this test:
;; File: process-issue.scm
(define (run n)
(let loop ((i 0))
(if (< i n)
(begin
(if (= 0 (modulo i 100)) (println "--------- " i))
(with-exception-catcher
(lambda (exc)
(println "*** after " i " iterations got this exception:")
(display-exception exc))
(lambda ()
(call-with-output-process "echo" ##newline)))
(loop (+ i 1))))))
(run 10000)
Is there an easy way to run tests on the Cirrus machine without having to rebuild from scratch? It really makes experimenting painful...
Nope, none of the CI services have ever had that. Since it's just a Docker container, you can run the base image locally, but of course it's not exactly the same environment.
Actually, on my macOS machine I got this:
% gsi process-issue.scm
--------- 0
--------- 100
--------- 200
--------- 300
--------- 400
*** after 493 iterations got this exception:
Broken pipe
(force-output '#<input-output-port #2 (process "echo")>)
So this probably explains why the test is failing on Cirrus (maybe the probability of failure is higher because the echo
process terminates faster there than on my test machine and the race is more often won by the echo
process).
I think this should be declared a problem with the unit test. Otherwise, if Gambit's runtime system is modified to automatically silence the broken pipe error, there might be problematic situations where you actually care to know that the process died "prematurely" and you can't know.
Wow, that's weird.
I think this should be declared a problem with the unit test. Otherwise, if Gambit's runtime system is modified to automatically silence the broken pipe error, there might be problematic situations where you actually care to know that the process died "prematurely" and you can't know.
Agreed. Silencing errors is generally a recipe for worse problems later on.
I can disable that unit test and add a comment explaining why.
Instead of calling echo
, call sort
which also exists on unix and windows... The sort process will wait for the input to be provided.
Please check out the latest build: https://cirrus-ci.com/task/5067557282775040
The build
stage runs fine, but your process-issue.scm
test gives tons of broken pipe
errors with the echo
command all the time. So that's what caused it for sure.
The end of the build
stage says **** to compile the builtin modules you should: make modules
. Is there a principled way to deal with module building in CI so that it would always work right, or are there going to be some bootstrap issues sometimes no matter how it is done?
Indeed the echo
dying fast(er) on the Cirrus machine is clearly the issue.
The modules aren't built automatically by the build steps because a make modules
builds them as dynamically loadable .o1 files (OS "shared libraries") and, although this will work fine on linux, macOS and Windows, it is possible some OS in the wild does not support this and a make
would end with a build failure. Given that currently Gambit can work fine without building the modules it is an optional but recommended step (none of the modules currently call FFI code or use external OS libraries). For the CI builds a make modules
should always work except for the setting where --enable-ansi-c
is configured... which precludes calling dlopen
.
I added a make modules
step to the CI. In the .yml
file we always control which exact OS runs each build, and we know that the architecture is x86-64, so it shouldn't be a problem even to have very specific settings in there. The build commands can also vary by OS.
I'll try adding MacOS and FreeBSD next. Windows is also supported, but we have to figure out how to set up the compilers and run them. I've done Cygwin and Msys gcc builds on Cirrus. MSVC must be available too but I don't know how to run it.
The make without -j
is really slow; I'll try adding it back.
There appears to be something about make -j
that causes the build to try to do things more than once and then fail. https://cirrus-ci.com/task/4792435783237632 is an example.
This is not supposed to happen in normal CI, right?
**** checking if current gsc-boot must be built
**** building boot/gsc-boot
I suspect that the deselect-gen-for-commit
is done too early... in other words there is a missing makefile dependency for deselect-gen-for-commit
to cause it to happen after the files are generated.
This is not supposed to happen in normal CI, right?
**** checking if current gsc-boot must be built **** building boot/gsc-boot
Yes, but only once. It is because a full bootstrap must happen since we are starting with a fresh clone.
Good to know.
MacOS CI now works, except that 2 tests fail:
[161| 0] 67% ##########...... 6.6s 06-thread/thread_join.scm
*** FAILED 06-thread/thread_join.scm WITH EXIT CODE HI=1 LO=0
"unit-tests/06-thread/thread_join.scm"@30.1: FAILED (check-tail-exn join-timeout-exception? (lambda () (thread-join! t2 .001))) GOT 2
"unit-tests/06-thread/thread_join.scm"@32.1: FAILED (check-equal? (thread-join! t2 -1 123) 123) GOT 2
"unit-tests/06-thread/thread_join.scm"@33.1: FAILED (check-equal? (thread-join! t2 .001 123) 123) GOT 2
[165| 1] 70% ###########..... 7.6s 06-thread/thread_start.scm
*** FAILED 06-thread/thread_start.scm WITH EXIT CODE HI=1 LO=0
"unit-tests/06-thread/thread_start.scm"@32.1: FAILED (check-equal? var 1) GOT 0
Those are probably race conditions too...
That is strange... the behaviour makes me suspect that the heartbeat interrupts (for preemptime scheduling of the green threads) are somehow disabled. But the previous tests indicate that the heartbeat interrupts are at ~1000HZ. So I don't understand how that could be the case when the tests run fine on my macOS development machine (using clang and gcc-8).
Can we setup continuous integration at, say Cirrus CI, since it seems like the most advanced CI system.
ref: https://mailman.iro.umontreal.ca/pipermail/gambit-list/2019-November/009222.html