udem-dlteam / libs

Repo to develop new libraries for Gambit
5 stars 1 forks source link

Setting continuous integration #4

Closed amirouche closed 4 years ago

amirouche commented 4 years ago

Can we setup continuous integration at, say Cirrus CI, since it seems like the most advanced CI system.

ref: https://mailman.iro.umontreal.ca/pipermail/gambit-list/2019-November/009222.html

lassik commented 4 years ago

@weinholt knows a lot about CI and says Circle and GitLab have more advanced features than Cirrus. But Cirrus has a lot of platforms (Linux/Docker, Windows, Mac, FreeBSD) and is fast and easy. We could start with it. All of these new CI hosts ought to be easy to set up, so if it doesn't work out, it's not hard to change.

lassik commented 4 years ago

I love the ultra-simple UI of Cirrus: https://cirrus-ci.com/github/weinholt/ It's very fast and has only the information you need.

amirouche commented 4 years ago

I can make a pull request if you want

lassik commented 4 years ago

That would be great, but let's wait for @feeley to confirm. He probably has to register a Cirrus account and then enable it for the udem-dlteam organization from the GitHub settings.

One thing that would be good to investigate right away, is how publishing of build artifacts works. It would be great to always be able to download gsc and gsi binaries of the latest successful CI build.

amirouche commented 4 years ago

Le lun. 4 nov. 2019 à 09:38, Lassi Kortela notifications@github.com a écrit :

One thing that would be good to investigate right away, is how publishing of build artifacts works. It would be great to always be able to download gsc and gsi binaries of the latest successful CI build.

This is the interesting documentation: https://cirrus-ci.org/guide/writing-tasks/#artifacts-instruction

This will require to setup cirrus at gambit/gambit. Maybe it is easier for the time being to build gambit at each CI run?

amirouche commented 4 years ago

Related to code coverage: https://github.com/gambit/gambit/issues/94

lassik commented 4 years ago

We could also try to set it up for udem-dlteam/gambit first

lassik commented 4 years ago

Just realized I already have a Cirrus account, so I can add Cirrus to my personal fork of Gambit and find out how it goes. That's the fastest way to find out. If it's good we can then add it to gambit/gambit and this udem-dlteam. If there are problems, we save some account creation hassles and can try another provider.

feeley commented 4 years ago

I also realized I have one and just configured it for udem-dlteam/gambit to see how it goes.

lassik commented 4 years ago

CI often takes many tries to get right. Since CI services build all branches of the repo, it's usually easiest to do the build experimentation in a topic branch and merge to master once it's working.

lassik commented 4 years ago

BTW how much are the gsc and gsi executables affected by the ./configure --prefix=... argument? If the CI builds them with a particular prefix, can those binaries be copied onto another computer and still be usable from an arbitrary directory?

lassik commented 4 years ago

CI agent stopped responding! Container errored with 'OOMKilled'

Out of memory, and we're not even building with --enable-single-host :|

feeley commented 4 years ago

I have created a basic .cirrus.yml file and is currently executing. For future reference: it is important to pull the tags from the repo first so that the build process knows we need to bootstrap the compiler first instead of immediately compiling the (stale) .c files.

I'm not sure why it is giving a Container errored with 'OOMKilled' error. Maybe the -j on the make is too much...

@amirouche @lassik Since you now have write access to both repositories, I will let you continue the experimenting with Cirrus CI. If it looks good then we can move the changes to gambit/gambit.

We need to verify that Cirrus CI allows checking the main platforms (linux, macOS and Windows), and with the different build settings that are currently done with Travis and Appveyor.

Concerning the --prefix=..., the build process burns the path into the executable as the default "Gambit installation directory". If you want to use the executable on another machine, you will need to move the bin, lib and include directories, and use an explicit -:~~=<path> runtime option to inform the runtime system where those directories are. Here is some relevant information from INSTALL.txt:

If the --prefix option is not used, the default is to install all
files in /usr/local/Gambit and its subdirectories, namely "bin",
"lib", "include", etc.  The files that would normally go in these
subdirectories can be redirected to other directories using
the following configure options:

  --bindir=DIR            executables (gsi, gsc, ...)
  --libdir=DIR            libraries (libgambit.a, syntax-case.scm, ...)
  --includedir=DIR        C header files (gambit.h, ...)
  --docdir=DIR            documentation (gambit.pdf, gambit.html, ...)
  --infodir=DIR           info documentation (gambit.info, ...)
  --datadir=DIR           read-only architecture-independent data (gambit.el)

Note that the install target of the makefiles supports the DESTDIR
environment variable which allows staging an install to a specific
directory.  The command:

  % make install DESTDIR=/Users/feeley/stage

will install all the files in /Users/feeley/stage as though it was the
root of the filesystem.
lassik commented 4 years ago

Thanks! I'm quite excited about this so I can get to it right away. I'll create a lassik/cirrus branch in udem-dlteam/gambit for experimenting (IMHO it's clearest if each topic branch name says who is working on it, otherwise old mystery branches tend to be left behind in repos).

Not sure how to deal with the prefix for CI builds. Maybe something like ./configure --prefix=/usr/local/gambit/$CIRRUS_BRANCH?

feeley commented 4 years ago

Does it matter? Is the filesystem persistent?

For the OOM error, try removing the -j on the make.

lassik commented 4 years ago

It doesn't matter during the build as the CI file system vanishes afterwards. But we can pick some files to save from each successful build as artifacts so people can download them. It would probably be useful to save the gsc and gsi binaries in this way. If those depend on the prefix, then that prefix would have to be something that is likely to work sensibly on the personal computers of people who download the binaries.

feeley commented 4 years ago

Ok I understand. I guess it could be the commit hash. I guess the $CIRRUS_BRANCH is related to that. Note that historically I have used /usr/local/Gambit (capital G).

lassik commented 4 years ago

If /usr/local/Gambit is already in use we could stick with it. But in case someone has installed a stable release of Gambit in there, we'd do well to think of some prefix that doesn't overwrite that one with a cutting-edge version. The git branch or hash would be good since one can then try multiple experimental branches on the same computer. Is the hash too precise? It would change with every new commit to the same branch. The git tag is probably not precise enough, as bleeding-edge builds are not tagged yet.

lassik commented 4 years ago

$CIRRUS_BRANCH is just the Git branch. It sets other similar envars also: https://cirrus-ci.org/guide/writing-tasks/#environment-variables

feeley commented 4 years ago

Here's a useful configure option:

  --enable-multiple-versions
                          multiple installed versions are supported

It will put the installation in $prefix/<version> where version is something like v4.9.3 and will create the symbolic link current to point to the latest installed version. So using the commit hash as a "version" will not interfere.

lassik commented 4 years ago

Very nice! How is the <version> set and how does it determine which version is latest?

lassik commented 4 years ago

Without -j the make succeeds in Cirrus. But one unit test fails:

[223|  0]  94% ###############. 110.9s 13-modules/prim_port.scm
*** FAILED 13-modules/prim_port.scm WITH EXIT CODE HI=1 LO=0
(call-with-output-process "echo" ##newline) in (include "~~lib/_prim-port#.scm") returned #!void but expected #<os-exception #2>
(with-output-to-process "echo" ##newline) in (include "~~lib/_prim-port#.scm") returned #<os-exception #3> but expected #<os-exception #4>
(call-with-output-process "echo" ##newline) in (namespace ("##")) returned #<os-exception #5> but expected #<os-exception #2>
(with-output-to-process "echo" ##newline) in (namespace ("##")) returned #<os-exception #6> but expected #<os-exception #4>
feeley commented 4 years ago

Hey the tests passed except for 1 out of 237. Don't know what's happening with "echo".

I must say the build machine is underpowered... it took 110 seconds for the tests, and on my 2013 laptop it takes ~10 seconds for a make ut.

feeley commented 4 years ago

For the failing unit test, the result of (call-with-output-process "echo" ##newline) is compared to the result of (##call-with-output-process "echo" ##newline). It seems the first one is returning an OS exception and the second one is executing fine without an exception. This is very strange as these should be equivalent.

Note that echo is used because it is a program (or command) that exists on unix and windows, in this particular case a newline is sent to the echo program that ignores it, so nothing bad should happen. It has never caused problems in the past on my development machines, and Travis. Could it be a transient problem? Could it be related to how slow the machine is? Maybe the build should be repeated to make sure...

feeley commented 4 years ago

Hmmm I misread the errors. It seems they all (except for one) fail with an os-exception... Could it be that there's a limit on subprocesses?

lassik commented 4 years ago

I must say the build machine is underpowered... it took 110 seconds for the tests, and on my 2013 laptop it takes ~10 seconds for a make ut.

Their test runners are rented on-demand from Google cloud. Since we are using the free service, they probably use some of the slower offerings. It is possible to plug your own build server into Cirrus for free (and to some other CI services, e.g. GitLab). I'm not familiar with the technical details. The paid accounts on these services are presumably faster/have more RAM.

Since some of the tests do I/O, there may be some additional factor causing slowdown - not sure. 10x slowdown is quite a lot.

The FreeBSD jobs also often take a lot of time to get going because there's limited capacity and a queue for the free service. The others don't have FreeBSD at all.

For the failing unit test, the result of (call-with-output-process "echo" ##newline) is compared to the result of (##call-with-output-process "echo" ##newline). It seems the first one is returning an OS exception and the second one is executing fine without an exception. This is very strange as these should be equivalent.

Note that echo is used because it is a program (or command) that exists on unix and windows, in this particular case a newline is sent to the echo program that ignores it, so nothing bad should happen. It has never caused problems in the past on my development machines, and Travis. Could it be a transient problem? Could it be related to how slow the machine is? Maybe the build should be repeated to make sure...

Would be quite strange if speed problems cause an exception. I'll add the artifact settings which triggers another build. If the echo still fails, we could add that a debug print to get details about the exception.

feeley commented 4 years ago

Very nice! How is the <version> set and how does it determine which version is latest?

The version is $PACKAGE_VERSION from the configure script. My point is that if the user's Gambit was configures with the --enable-multiple-versions there will be no conflict if you use --prefix=/usr/local/Gambit/<commit_hash>.

Perhaps the configure script should have a mechanism to specify the version, other than using $PACKAGE_VERSION which would be the default. Maybe a --enable-version=<commit_hash>.

lassik commented 4 years ago

Would be nice to have an --enable-single-host build as well, but it will probably be extremely slow and possibly run out of RAM. We get what we pay for :)

lassik commented 4 years ago

https://cirrus-ci.org/guide/linux/ says:

Linux Community Cluster is a Kubernetes cluster running on Google Kubernetes Engine that is available free of charge for Open Source community. Paying customers can also use Community Cluster for personal private repositories or buy CPU time with compute credits for their private organization repositories.

Community Cluster is configured the same way as anyone can configure a personal GKE cluster as described here.

By default a container is given 2 CPUs and 4 GB of memory but it can be configured in .cirrus.yml:

Containers on Community Cluster can use maximum 8.0 CPUs and up to 24 GB of memory.

feeley commented 4 years ago

I like the idea of running the CI locally. But we can explore that later if Cirrus gives the features we need.

You should try --enable-single-host in the next round. As long as there's no parallel build I'm confident it will pass.

lassik commented 4 years ago

Can you estimate how many gigs of RAM it needs?

lassik commented 4 years ago

We can try being greedy and using the maximum of 8 CPUs and 24 GB RAM. Maybe they have a fixed capacity, so if you request higher specs you'll have to wait longer in the queue for the build to start.

feeley commented 4 years ago

Can you estimate how many gigs of RAM it needs?

The memory usage of gcc is probably the limiting factor and it greatly depends on the gcc version. For gcc 8 which I use regularly, I haven't seen more that 1 GB RSS per gcc compilation.

It should be viewed as a bug if 4 GB RAM is not enough to build Gambit (for a non-parallel build).

lassik commented 4 years ago

Good to know. I tried using max CPU and RAM and make -j - now it went really fast. The log looks like it tries to run configure and compilation thrice (???) and finally fails at something. https://cirrus-ci.com/task/6161948470673408

feeley commented 4 years ago

If we run the CI locally we would be using machines where a make -j is not a problem and is fast. Here are some build times on some of these machines (for a build including the git clone):

380 seconds    gambit.iro.umontreal.ca     recent mac mini with 6 cores
240 seconds    arctic.iro.umontreal.ca     liquid cooled Debian linux PC

That's a lot better than the ~1200 seconds with the Kubernetes and similar to the max CPU and RAM case (with no guarantees I assume).

For the build issue, could you try removing the -j? I suspect it is a concurrency issue with the makefiles (probably related to git).

feeley commented 4 years ago

Without -j the make succeeds in Cirrus. But one unit test fails:

[223|  0]  94% ###############. 110.9s 13-modules/prim_port.scm
*** FAILED 13-modules/prim_port.scm WITH EXIT CODE HI=1 LO=0
(call-with-output-process "echo" ##newline) in (include "~~lib/_prim-port#.scm") returned #!void but expected #<os-exception #2>
(with-output-to-process "echo" ##newline) in (include "~~lib/_prim-port#.scm") returned #<os-exception #3> but expected #<os-exception #4>
(call-with-output-process "echo" ##newline) in (namespace ("##")) returned #<os-exception #5> but expected #<os-exception #2>
(with-output-to-process "echo" ##newline) in (namespace ("##")) returned #<os-exception #6> but expected #<os-exception #4>

I suspect a problem with subprocesses... Is there an easy way to run tests on the Cirrus machine without having to rebuild from scratch? It really makes experimenting painful...

Anyway, please run this test:

;; File: process-issue.scm

(define (run n)
  (let loop ((i 0))
    (if (< i n)
        (begin
          (if (= 0 (modulo i 100)) (println "--------- " i))
          (with-exception-catcher
           (lambda (exc)
             (println "*** after " i " iterations got this exception:")
             (display-exception exc))
           (lambda ()
             (call-with-output-process "echo" ##newline)))
          (loop (+ i 1))))))

(run 10000)
lassik commented 4 years ago

Is there an easy way to run tests on the Cirrus machine without having to rebuild from scratch? It really makes experimenting painful...

Nope, none of the CI services have ever had that. Since it's just a Docker container, you can run the base image locally, but of course it's not exactly the same environment.

feeley commented 4 years ago

Actually, on my macOS machine I got this:

% gsi process-issue.scm 
--------- 0
--------- 100
--------- 200
--------- 300
--------- 400
*** after 493 iterations got this exception:
Broken pipe
(force-output '#<input-output-port #2 (process "echo")>)

So this probably explains why the test is failing on Cirrus (maybe the probability of failure is higher because the echo process terminates faster there than on my test machine and the race is more often won by the echo process).

I think this should be declared a problem with the unit test. Otherwise, if Gambit's runtime system is modified to automatically silence the broken pipe error, there might be problematic situations where you actually care to know that the process died "prematurely" and you can't know.

lassik commented 4 years ago

Wow, that's weird.

I think this should be declared a problem with the unit test. Otherwise, if Gambit's runtime system is modified to automatically silence the broken pipe error, there might be problematic situations where you actually care to know that the process died "prematurely" and you can't know.

Agreed. Silencing errors is generally a recipe for worse problems later on.

I can disable that unit test and add a comment explaining why.

feeley commented 4 years ago

Instead of calling echo, call sort which also exists on unix and windows... The sort process will wait for the input to be provided.

lassik commented 4 years ago

Please check out the latest build: https://cirrus-ci.com/task/5067557282775040

The build stage runs fine, but your process-issue.scm test gives tons of broken pipe errors with the echo command all the time. So that's what caused it for sure.

The end of the build stage says **** to compile the builtin modules you should: make modules. Is there a principled way to deal with module building in CI so that it would always work right, or are there going to be some bootstrap issues sometimes no matter how it is done?

feeley commented 4 years ago

Indeed the echo dying fast(er) on the Cirrus machine is clearly the issue.

The modules aren't built automatically by the build steps because a make modules builds them as dynamically loadable .o1 files (OS "shared libraries") and, although this will work fine on linux, macOS and Windows, it is possible some OS in the wild does not support this and a make would end with a build failure. Given that currently Gambit can work fine without building the modules it is an optional but recommended step (none of the modules currently call FFI code or use external OS libraries). For the CI builds a make modules should always work except for the setting where --enable-ansi-c is configured... which precludes calling dlopen.

lassik commented 4 years ago

I added a make modules step to the CI. In the .yml file we always control which exact OS runs each build, and we know that the architecture is x86-64, so it shouldn't be a problem even to have very specific settings in there. The build commands can also vary by OS.

I'll try adding MacOS and FreeBSD next. Windows is also supported, but we have to figure out how to set up the compilers and run them. I've done Cygwin and Msys gcc builds on Cirrus. MSVC must be available too but I don't know how to run it.

The make without -j is really slow; I'll try adding it back.

lassik commented 4 years ago

There appears to be something about make -j that causes the build to try to do things more than once and then fail. https://cirrus-ci.com/task/4792435783237632 is an example.

lassik commented 4 years ago

This is not supposed to happen in normal CI, right?

**** checking if current gsc-boot must be built
**** building boot/gsc-boot
feeley commented 4 years ago

I suspect that the deselect-gen-for-commit is done too early... in other words there is a missing makefile dependency for deselect-gen-for-commit to cause it to happen after the files are generated.

feeley commented 4 years ago

This is not supposed to happen in normal CI, right?

**** checking if current gsc-boot must be built
**** building boot/gsc-boot

Yes, but only once. It is because a full bootstrap must happen since we are starting with a fresh clone.

lassik commented 4 years ago

Good to know.

MacOS CI now works, except that 2 tests fail:

[161|  0]  67% ##########...... 6.6s 06-thread/thread_join.scm
*** FAILED 06-thread/thread_join.scm WITH EXIT CODE HI=1 LO=0
"unit-tests/06-thread/thread_join.scm"@30.1: FAILED (check-tail-exn join-timeout-exception? (lambda () (thread-join! t2 .001))) GOT 2
"unit-tests/06-thread/thread_join.scm"@32.1: FAILED (check-equal? (thread-join! t2 -1 123) 123) GOT 2
"unit-tests/06-thread/thread_join.scm"@33.1: FAILED (check-equal? (thread-join! t2 .001 123) 123) GOT 2

[165|  1]  70% ###########..... 7.6s 06-thread/thread_start.scm
*** FAILED 06-thread/thread_start.scm WITH EXIT CODE HI=1 LO=0
"unit-tests/06-thread/thread_start.scm"@32.1: FAILED (check-equal? var 1) GOT 0
feeley commented 4 years ago

Those are probably race conditions too...

feeley commented 4 years ago

That is strange... the behaviour makes me suspect that the heartbeat interrupts (for preemptime scheduling of the green threads) are somehow disabled. But the previous tests indicate that the heartbeat interrupts are at ~1000HZ. So I don't understand how that could be the case when the tests run fine on my macOS development machine (using clang and gcc-8).