IC cluster tests on concourse CI

miguelsimon / IC

0 stars 0 forks source link

IC cluster tests on concourse CI #3

Open miguelsimon opened 5 years ago

miguelsimon commented 5 years ago

@mmkekic and @jmbenlloch let's use this issue to track stuff related to running cluster tests on PRs as we currently lack a common communications channel.

I created the issue in the fork as opposed to the main repo because I'm unsure on whether you want me to spam all the other IC contributors.

It's in the spirit of @jmbenlloch's proposal in the issue describing the problem but more lightweight, which means we should be able to get it to work quickly. Quickly means in one day if we've got the scripts that do the pbs submit and result summarization.

Conclusions I got from our conversation last thursday, I'll write them down before I forget:

job duration has no upper bound eg they might take a day to run
@jmbenlloch will want to host this on ific infrastructure eventually, we're running on a trial version of gce at the moment hosted here https://ci.ific-invisible-cities.com/
the artifacts emitted by the comparison (some sort of report with statistics, maybe .hdf5 files) will probably be uploaded to an ftp server hosted at ific and linked from the PR for easy consumption
Initially @jmbenlloch and @mmkekic will PR to miguelsimon/IC while working on this feature, once we've got it working here I'll PR nextic/IC so other people can evaluate and if it's acceptable maintenance will be taken over by @jmbenlloch and @mmkekic
access control to the CI server will be mediated by membership in the github nextic organization using concourse github auth this is trivial to set up and it eliminates lots of tedious operational inconveniences and wanking around with passwords

@jmbenlloch there was a bug in my build pipeline we would have caught if your laptop hadn't run out of space when running the local build haha, that's fixed now.

I'll be (somewhat) available during the weekends to work on this.

miguelsimon commented 5 years ago

@jjgomezcadenas here's my view of what needs to happen to get a simple and useful mechanism to do the version testing you guys need.

We're actually very close; I'm pretty sure I can get it working in a day or two given the correct credentials and the scripts, the functionality from the point of view of the CI system is already implemented using dummy scripts to simulate the cluster tests.

Scripts that should run locally given correct credentials (mostly @mmkekic)

[x] script that rsyncs master and candidate directories to a known location on the cluster
[x] script that submits output jobs to pbs and polls until completion:
- one job generates output from master
- another job generates output from candidate
[x] script that submits comparison job to pbs and polls until completion
[x] script that collects comparison job output via rsync

Integration (mostly @miguelsimon)

[x] concourse CI server accessible on the internet for proof of concept
[x] PR triggers unit tests, disallow merge on unit test failure
[x] PR triggers fake cluster tests if unit tests pass
[x] PR allows manual triggering of real cluster tests if unit tests pass (as proposed by @jmbenlloch), to do this I'll need:
- [x] access to IC slack to easily communicate with @jmbenlloch, @mmkekic and other people involved
- [x] credentials for accessing the cluster from the concourse installation, ideally public/private keypair to ssh into the cluster
- [ ] the scripts that do the actual work, described above
[ ] comparison summaries hosted somewhere accessible for proof of concept, linked from PR
[ ] desired workflow successfully tested with PRs on miguelsimon/IC by multiple participants

Maintenance (mostly @jmbenlloch and @mmkekic I guess)

[ ] feature integrated via PR from miguelsimon/IC to nextic/IC
[ ] branch protection rules set up in nextic/IC
[ ] concourse CI server hosted on IFIC infrastructure
[ ] comparison summaries hosted on IFIC infrastructure
[ ] access to concourse CI server mediated by membership in the github nextic organization

miguelsimon commented 5 years ago

Quick update, this mainly concerns @mmkekic, the others can safely ignore this issue for now:

I have access to the cluster and have successfully submitted a trivial job via qsub, yay!

The cluster-tests.sh script is slightly less trivial now, given the correct credentials:

it sshs to the cluster to check connectivity
it rsyncs the IC and IC_master inputs to (currently hardcoded) paths in the cluster
the jobs that do the actual work are left as an exercise to the reader ;)

There's rudimentary docs describing how to use fly execute to test out the script from local content.

miguelsimon commented 5 years ago

The CI pipeline now does real work, check it out at https://ci.ific-invisible-cities.com/:

if unit tests pass the cluster test step can be manually triggered per @jmbenlloch's proposed workflow, you can test it out by pushing the "+" in the concourse web interface
it submits real city jobs (testing irene and dorothea now) on what I assume are small datasets found in /data_extra2/mmkekic/example_inputs
it places the outputs in a known location on the cluster
it waits for the jobs to finish
it does not yet compare the .h5 outputs, I'll ask about how to do this next week.

Jobs are specified in python via a dsl, it's a heavyweight approach but life is too short to NOT automate this haha. For example, here's the current specification for the testing jobs, found in the miguel_jobs.py script:

config = Config(
    bashrc="/home/icdev/.bashrc",
    conda_sh="/home/icdev/miniconda/etc/profile.d/conda.sh",
    conda_activate="IC-3.7-2018-11-14",
    conda_lib="/home/icdev/miniconda/lib",
    remote_dir="/data_extra2/icdev/miguel_scratch",
)

specs = [
    CitySpec(
        city="irene",
        input_path="/data_extra2/mmkekic/example_inputs/run_6971_0009_trigger1_waveforms.h5",
        output_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
        ic_version="master",
    ),
    CitySpec(
        city="dorothea",
        input_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
        output_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_kdst.h5",
        ic_version="master",
    ),

    CitySpec(
        city="irene",
        input_path="/data_extra2/mmkekic/example_inputs/run_6971_0009_trigger1_waveforms.h5",
        output_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
        ic_version="pr",
    ),
    CitySpec(
        city="dorothea",
        input_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
        output_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_kdst.h5",
        ic_version="pr",
    ),
]

This specification is compiled into a local artifact containing the versions and bash scripts that chain the jobs together (right now I'm just making them serially dependent, in future I'll use the implicit dependency graph based on their outputs to parallelize the jobs). It can be locally inspected and then rsynced to the cluster.

The runner submits the (now serially dependent) jobs and waits for the last job to complete, periodically printing the current job queue to screen. The last job writes a file to signal successful completion. Log output is noisy but easy to understand eg.

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
126614.majorana1           job_0.sh         icdev                  0 Q short          
126615.majorana1           job_1.sh         icdev                  0 H short          
126616.majorana1           job_2.sh         icdev                  0 H short          
126617.majorana1           job_3.sh         icdev                  0 H short          
126618.majorana1           all_ok.sh        icdev                  0 H short

I'm limited by my understanding of what we want to test (eg. you'll notice there's a single conda installation for both branches, it's pretty likely we'll want a way to compare different conda environments) so that will evolve as you guys tell me what the tests should be doing.

There's quite a lot of cargo culting on my side, especially with the conda environment, I'm sure it can be slimmed down and simplified considerably.

miguelsimon commented 5 years ago

I've added functionality to express comparison jobs in addition to city jobs, see the updated job specification:

ComparePmapSpec(
    master_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
    pr_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
    output_path="/data_extra2/icdev/miguel_scratch/outputs/compare_run_6971_0009_trigger1_pmaps.txt",
    ic_version="master",
)

Right now this only runs h5diff, reporting the output on nonzero exit status, but can easily be extended to accomodate arbitrary comparison functions and outputs. I'll ask you guys about that next week and can write it next weekend.

I understand these jobs should run on all the files, or a large subset of the files, at majorana1.ific.uv.es:/analysis/6971/hdf5/data. The current approach can easily be made to emit the optimally parallel job sequencing but I need to check if I'm going to hit job scheduling limits, as a comparison on all those files will mean submitting thousands of jobs.

jjgomezcadenas commented 5 years ago

amazing work, Miguel! Thanks!

miguelsimon commented 5 years ago

Job control

Question for @jmbenlloch:

To get a ballpark number for number of jobs, I counted the .h5 files in /analysis/6971/hdf5/data

[icdev@majorana1 data]$ ls *.h5 | wc
   5944    5944  213984

and multiplied by 5 for reasons.

My plan is to emit all jobs and their dependencies to PBS; for simplicity I make them serially dependent now but I can easily build a minicompiler to get the dependency graph between jobs.

I'd like to emit them all and let PBS sort it out; basically I'd emit a long series of commands like:

job_0=$(qsub jobs/job_0.sh)
job_1=$(qsub -W depend=afterok:$job_0 jobs/job_1.sh)
job_2=$(qsub -W depend=afterok:$job_1 jobs/job_2.sh)
job_3=$(qsub -W depend=afterok:$job_2  depend=afterok:$job_1 jobs/job_3.sh)
job_4=$(qsub jobs/job_4.sh)
...

I'd rather not build my ad-hoc job control system to emit jobs in smaller batches if PBS will do the job; are there any fundamental limitations with this approach or do you think it'll work into the thousands of jobs?

Comparisons & report generation

I'm following the convention that each comparison job emits a directory as output; these are collected and can be emitted in a variety of output formats (eg. .txt for now, html coming).

I'm using h5diff to prototype this, and when the time comes I can just take whatever jobs you guys are running, emit them to a directory, and assemble them into fancy reports. For example, I can use nbconvert to emit html from ipython notebooks if you'd like to generate comparison reports using .ipynb files.

So I'll defer asking you about the comparison part until I'm ready to plug it in :)

jmbenlloch commented 5 years ago

Hi @miguelsimon, the work you are doing is awesome, looks very good :)

Regarding your question:

are there any fundamental limitations with this approach or do you think it'll work into the thousands of jobs?

I'm not aware of any limitation in that sense, but it is also true that I have never tried to launch that amount of dependent jobs... I'd say you try out, if everything works, it is probably the easiest solution.

miguelsimon commented 5 years ago

Haha @jmbenlloch I like the empirical approach I'll test it out when nobody else is using the cluster.

The first version of html report generation is up; I'm not the worlds finest interface designer but it fits my desiderata:

easily navigate the results of potentially thousands of comparison jobs
easily identify failing jobs by sorting them at the top and coloring them red
toggle detailed output display with a button, so we can easily switch from overviewing all comparison jobs to detailed examination of the comparison

Here's an example using h5diff on part of the /analysis/6971/hdf5/data set; the first screenshot illustrates the overview:

And the second one illustrates the toggling of detailed output for a job:

Next step is to upload those reports somewhere. It should be very easy to plug in your real comparison scripts in place of h5diff output so I'm focusing on other stuff.

miguelsimon commented 5 years ago

@mmkekic will be at ific the week of the 10th of september and says that's a good time, so I'll visit you guys then and see if we can get the first version working that week.

The last remaining chunk of functionality we need is a public (static) http server to upload & browse comparison output results.

Once we have that, the whole system will be working end-to-end for PRs opened on the miguelsimon/IC repo and we can test it out; remaining stuff like implementing realistic comparison jobs instead of h5diff and compiling the job dependencies for optimum parallelism are very important but can be done incrementally, requiring less synchronization.

Given that this will all end up running on IFIC infrastructure anyway it'd be great if you guys could provide the static http server for this @jmbenlloch. I'd need:

a static http server running on some named host eg. ci-data.ific.uv.es that's publicly accessible
a username and a private key (or password) with rsync permissions to the static html folder, so the CI can upload the comparison results
the path of the static html folder

Is that easy for you guys to set up in the next week or so, so I can get all my ducks in a row for the 10th?

jjgomezcadenas commented 5 years ago

Yes, BY ALL MEANS. Thanks Miguel!

On 27 Aug 2019, at 17:54, miguelsimon notifications@github.com wrote:

@mmkekic https://github.com/mmkekic will be at ific the week of the 10th of september and says that's a good time, so I'll visit you guys then and see if we can get the first version working that week.

The last remaining chunk of functionality we need is a public (static) http server to upload & browse comparison output results.

Once we have that, the whole system will be working end-to-end for PRs opened on the miguelsimon/IC https://github.com/miguelsimon/IC repo and we can test it out; remaining stuff like implementing realistic comparison jobs instead of h5diff and compiling the job dependencies for optimum parallelism are very important but can be done incrementally, requiring less synchronization.

Given that this will all end up running on IFIC infrastructure anyway it'd be great if you guys could provide the static http server for this @jmbenlloch https://github.com/jmbenlloch. I'd need:

a static http server running on some named host eg. ci-data.ific.uv.es that's publicly accessible a username and a private key (or password) with rsync permissions to the static html folder, so the CI can upload the comparison results the path of the static html folder Is that easy for you guys to set up in the next week or so, so I can get all my ducks in a row for the 10th?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/miguelsimon/IC/issues/3?email_source=notifications&email_token=AB5SIDY7ETTHQLAQERMYHW3QGVE23A5CNFSM4IFN6OE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IHDZI#issuecomment-525365733, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5SIDZ4JR3WI77QHCOMWDDQGVE23ANCNFSM4IFN6OEQ.

miguelsimon commented 5 years ago

@jocarbur is experimenting with the deploy to see how it should fit into the ific infrastructure. The documentation needs an update: the role of .gitignored credentials folder needs to be explained explicitly by @miguelsimon.

Next steps might be:

[x] @jocarbur gives @miguelsimon appropriate credentials to rsync html summaries to an ific http server
[ ] @miguelsimon updates the simple-cluster-tests.sh script so html summaries are uploaded to the ific http server
[ ] @miguelsimon verifies PR workflow including http serving of summaries works correctly
[ ] first meaningful histogram comparison is implemented and called from simple-cluster-tests.sh, this requires the following subgoals:
- [ ] we choose the first dataset we're going to compare, currently it's /data_extra2/mmkekic/example_inputs/run_6971_0009_trigger1_waveforms.h5 chosen by @miguelsimon at random
- [ ] we choose the first city we're going to run, currently it's irene chosen by @miguelsimon at random
- [ ] histogram comparison for this dataset and city output is implemented by @mmkekic

miguelsimon commented 5 years ago

Our current objective is to implement one sensible histogram comparison for one city, as discussed on thursday.

I've refactored the code in line with that goal:

I've removed the complex code generation for now as it just adds needless complexity for such a simple task
I've removed the html report generation until we have something sensible to report haha

As described in the README the pipeline now generates the output h5 pmap files for PR and master and retrieves them from the cluster.

This should make it easy for @mmkekic to develop the first simple histogram-comparing script, which takes those output h5 pmaps as inputs.

Once the histogram-comparing script is ready I can easily take its output and format it as html, and get the pipeline to upload it to the ific http server once that's online. The task list on the post above has been updated to reflect this.

miguelsimon commented 5 years ago

@jocarbur has successfully deployed concourse to https://gpu1next.ific.uv.es/.

We've decided to serve the html files from within the same docker-compose installation that runs concourse; I'll extend the docker-compose.yml file @jocarbur is using for https://gpu1next.ific.uv.es/ with this functionality.

Once @jmbenlloch is back and it's convenient for @jocarbur as well I'll head over to ific to set it up and the three of us can talk about it.

miguelsimon commented 5 years ago

@jocarbur has set up the http server for static html files in the same docker-compose install that houses concourse and nginx.

static files are uploaded to icdev@html_repo:/downloads/ (this is only visible within the docker networking context)
the icdev private key is the same key that's used for accessing the majorana cluster
results are served at https://gpu1next.ific.uv.es:4443/downloads/

The simple-cluster-tests.sh script now writes its .h5 outputs to the static file server and they're visible at https://gpu1next.ific.uv.es:4443/downloads/

miguelsimon commented 5 years ago

After discussing it with @jocarbur and @jmbenlloch today, the next step is getting @jocarbur's current gpu1next configuration into source control so that it's trivial to redeploy if the machine crashes. To that end, we've decided that @jmbenlloch can:

fork this repo
excise the hardcoded credentials from the gpu1next docker-compose.yml file and save it as IC/concourse-ci/gpu1next-docker-compose.yml
add a launch_gpu1next_concourse rule to the IC/concourse-ci/Makefile following the example of the launch-prod-concourse rule
document needed credentials in IC/concourse-ci/README.md

There's one upside to the fact that nobody seems to care about using this haha: we can tear down the current gpu1next setup and deploy by cloning the repo and following the instructions. If that works we've validated that we can redeploy after machine failure and take it from there.

The required functionality is provided by the current setup, possible improvements:

hardcode a trivial credential for the sshd user given it's not exposed outside the docker networking context; this avoids the complexity of parameterizing in terms of that credential
add a docker registry to the docker-compose file that's only visible within the docker networking context so we can store intermediate images (eg. the conda installation) and avoid rebuilding it every time a PR occurs.