miguelsimon / IC

0 stars 0 forks source link

IC cluster tests on concourse CI #3

Open miguelsimon opened 5 years ago

miguelsimon commented 5 years ago

@mmkekic and @jmbenlloch let's use this issue to track stuff related to running cluster tests on PRs as we currently lack a common communications channel.

I created the issue in the fork as opposed to the main repo because I'm unsure on whether you want me to spam all the other IC contributors.

It's in the spirit of @jmbenlloch's proposal in the issue describing the problem but more lightweight, which means we should be able to get it to work quickly. Quickly means in one day if we've got the scripts that do the pbs submit and result summarization.

Conclusions I got from our conversation last thursday, I'll write them down before I forget:

@jmbenlloch there was a bug in my build pipeline we would have caught if your laptop hadn't run out of space when running the local build haha, that's fixed now.

I'll be (somewhat) available during the weekends to work on this.

miguelsimon commented 5 years ago

@jjgomezcadenas here's my view of what needs to happen to get a simple and useful mechanism to do the version testing you guys need.

We're actually very close; I'm pretty sure I can get it working in a day or two given the correct credentials and the scripts, the functionality from the point of view of the CI system is already implemented using dummy scripts to simulate the cluster tests.

Scripts that should run locally given correct credentials (mostly @mmkekic)

Integration (mostly @miguelsimon)

Maintenance (mostly @jmbenlloch and @mmkekic I guess)

miguelsimon commented 5 years ago

Quick update, this mainly concerns @mmkekic, the others can safely ignore this issue for now:

I have access to the cluster and have successfully submitted a trivial job via qsub, yay!

The cluster-tests.sh script is slightly less trivial now, given the correct credentials:

  1. it sshs to the cluster to check connectivity
  2. it rsyncs the IC and IC_master inputs to (currently hardcoded) paths in the cluster
  3. the jobs that do the actual work are left as an exercise to the reader ;)

There's rudimentary docs describing how to use fly execute to test out the script from local content.

miguelsimon commented 5 years ago

The CI pipeline now does real work, check it out at https://ci.ific-invisible-cities.com/:

Jobs are specified in python via a dsl, it's a heavyweight approach but life is too short to NOT automate this haha. For example, here's the current specification for the testing jobs, found in the miguel_jobs.py script:

config = Config(
    bashrc="/home/icdev/.bashrc",
    conda_sh="/home/icdev/miniconda/etc/profile.d/conda.sh",
    conda_activate="IC-3.7-2018-11-14",
    conda_lib="/home/icdev/miniconda/lib",
    remote_dir="/data_extra2/icdev/miguel_scratch",
)

specs = [
    CitySpec(
        city="irene",
        input_path="/data_extra2/mmkekic/example_inputs/run_6971_0009_trigger1_waveforms.h5",
        output_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
        ic_version="master",
    ),
    CitySpec(
        city="dorothea",
        input_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
        output_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_kdst.h5",
        ic_version="master",
    ),

    CitySpec(
        city="irene",
        input_path="/data_extra2/mmkekic/example_inputs/run_6971_0009_trigger1_waveforms.h5",
        output_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
        ic_version="pr",
    ),
    CitySpec(
        city="dorothea",
        input_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
        output_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_kdst.h5",
        ic_version="pr",
    ),
]

This specification is compiled into a local artifact containing the versions and bash scripts that chain the jobs together (right now I'm just making them serially dependent, in future I'll use the implicit dependency graph based on their outputs to parallelize the jobs). It can be locally inspected and then rsynced to the cluster.

The runner submits the (now serially dependent) jobs and waits for the last job to complete, periodically printing the current job queue to screen. The last job writes a file to signal successful completion. Log output is noisy but easy to understand eg.

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
126614.majorana1           job_0.sh         icdev                  0 Q short          
126615.majorana1           job_1.sh         icdev                  0 H short          
126616.majorana1           job_2.sh         icdev                  0 H short          
126617.majorana1           job_3.sh         icdev                  0 H short          
126618.majorana1           all_ok.sh        icdev                  0 H short

I'm limited by my understanding of what we want to test (eg. you'll notice there's a single conda installation for both branches, it's pretty likely we'll want a way to compare different conda environments) so that will evolve as you guys tell me what the tests should be doing.

There's quite a lot of cargo culting on my side, especially with the conda environment, I'm sure it can be slimmed down and simplified considerably.

miguelsimon commented 5 years ago

I've added functionality to express comparison jobs in addition to city jobs, see the updated job specification:

ComparePmapSpec(
    master_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
    pr_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
    output_path="/data_extra2/icdev/miguel_scratch/outputs/compare_run_6971_0009_trigger1_pmaps.txt",
    ic_version="master",
)

Right now this only runs h5diff, reporting the output on nonzero exit status, but can easily be extended to accomodate arbitrary comparison functions and outputs. I'll ask you guys about that next week and can write it next weekend.

I understand these jobs should run on all the files, or a large subset of the files, at majorana1.ific.uv.es:/analysis/6971/hdf5/data. The current approach can easily be made to emit the optimally parallel job sequencing but I need to check if I'm going to hit job scheduling limits, as a comparison on all those files will mean submitting thousands of jobs.

jjgomezcadenas commented 5 years ago

amazing work, Miguel! Thanks!

miguelsimon commented 5 years ago

Job control

Question for @jmbenlloch:

To get a ballpark number for number of jobs, I counted the .h5 files in /analysis/6971/hdf5/data

[icdev@majorana1 data]$ ls *.h5 | wc
   5944    5944  213984

and multiplied by 5 for reasons.

My plan is to emit all jobs and their dependencies to PBS; for simplicity I make them serially dependent now but I can easily build a minicompiler to get the dependency graph between jobs.

I'd like to emit them all and let PBS sort it out; basically I'd emit a long series of commands like:

job_0=$(qsub jobs/job_0.sh)
job_1=$(qsub -W depend=afterok:$job_0 jobs/job_1.sh)
job_2=$(qsub -W depend=afterok:$job_1 jobs/job_2.sh)
job_3=$(qsub -W depend=afterok:$job_2  depend=afterok:$job_1 jobs/job_3.sh)
job_4=$(qsub jobs/job_4.sh)
...

I'd rather not build my ad-hoc job control system to emit jobs in smaller batches if PBS will do the job; are there any fundamental limitations with this approach or do you think it'll work into the thousands of jobs?

Comparisons & report generation

I'm following the convention that each comparison job emits a directory as output; these are collected and can be emitted in a variety of output formats (eg. .txt for now, html coming).

I'm using h5diff to prototype this, and when the time comes I can just take whatever jobs you guys are running, emit them to a directory, and assemble them into fancy reports. For example, I can use nbconvert to emit html from ipython notebooks if you'd like to generate comparison reports using .ipynb files.

So I'll defer asking you about the comparison part until I'm ready to plug it in :)

jmbenlloch commented 5 years ago

Hi @miguelsimon, the work you are doing is awesome, looks very good :)

Regarding your question:

are there any fundamental limitations with this approach or do you think it'll work into the thousands of jobs?

I'm not aware of any limitation in that sense, but it is also true that I have never tried to launch that amount of dependent jobs... I'd say you try out, if everything works, it is probably the easiest solution.

miguelsimon commented 5 years ago

Haha @jmbenlloch I like the empirical approach I'll test it out when nobody else is using the cluster.

The first version of html report generation is up; I'm not the worlds finest interface designer but it fits my desiderata:

Here's an example using h5diff on part of the /analysis/6971/hdf5/data set; the first screenshot illustrates the overview:

Screenshot 2019-08-03 at 12 32 22

And the second one illustrates the toggling of detailed output for a job:

Screenshot 2019-08-03 at 12 32 31

Next step is to upload those reports somewhere. It should be very easy to plug in your real comparison scripts in place of h5diff output so I'm focusing on other stuff.

miguelsimon commented 5 years ago

@mmkekic will be at ific the week of the 10th of september and says that's a good time, so I'll visit you guys then and see if we can get the first version working that week.

The last remaining chunk of functionality we need is a public (static) http server to upload & browse comparison output results.

Once we have that, the whole system will be working end-to-end for PRs opened on the miguelsimon/IC repo and we can test it out; remaining stuff like implementing realistic comparison jobs instead of h5diff and compiling the job dependencies for optimum parallelism are very important but can be done incrementally, requiring less synchronization.

Given that this will all end up running on IFIC infrastructure anyway it'd be great if you guys could provide the static http server for this @jmbenlloch. I'd need:

Is that easy for you guys to set up in the next week or so, so I can get all my ducks in a row for the 10th?

jjgomezcadenas commented 5 years ago

Yes, BY ALL MEANS. Thanks Miguel!

On 27 Aug 2019, at 17:54, miguelsimon notifications@github.com wrote:

@mmkekic https://github.com/mmkekic will be at ific the week of the 10th of september and says that's a good time, so I'll visit you guys then and see if we can get the first version working that week.

The last remaining chunk of functionality we need is a public (static) http server to upload & browse comparison output results.

Once we have that, the whole system will be working end-to-end for PRs opened on the miguelsimon/IC https://github.com/miguelsimon/IC repo and we can test it out; remaining stuff like implementing realistic comparison jobs instead of h5diff and compiling the job dependencies for optimum parallelism are very important but can be done incrementally, requiring less synchronization.

Given that this will all end up running on IFIC infrastructure anyway it'd be great if you guys could provide the static http server for this @jmbenlloch https://github.com/jmbenlloch. I'd need:

a static http server running on some named host eg. ci-data.ific.uv.es that's publicly accessible a username and a private key (or password) with rsync permissions to the static html folder, so the CI can upload the comparison results the path of the static html folder Is that easy for you guys to set up in the next week or so, so I can get all my ducks in a row for the 10th?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/miguelsimon/IC/issues/3?email_source=notifications&email_token=AB5SIDY7ETTHQLAQERMYHW3QGVE23A5CNFSM4IFN6OE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IHDZI#issuecomment-525365733, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5SIDZ4JR3WI77QHCOMWDDQGVE23ANCNFSM4IFN6OEQ.

miguelsimon commented 5 years ago

@jocarbur is experimenting with the deploy to see how it should fit into the ific infrastructure. The documentation needs an update: the role of .gitignored credentials folder needs to be explained explicitly by @miguelsimon.

Next steps might be:

miguelsimon commented 5 years ago

Our current objective is to implement one sensible histogram comparison for one city, as discussed on thursday.

I've refactored the code in line with that goal:

As described in the README the pipeline now generates the output h5 pmap files for PR and master and retrieves them from the cluster.

This should make it easy for @mmkekic to develop the first simple histogram-comparing script, which takes those output h5 pmaps as inputs.

Once the histogram-comparing script is ready I can easily take its output and format it as html, and get the pipeline to upload it to the ific http server once that's online. The task list on the post above has been updated to reflect this.

miguelsimon commented 5 years ago

@jocarbur has successfully deployed concourse to https://gpu1next.ific.uv.es/.

We've decided to serve the html files from within the same docker-compose installation that runs concourse; I'll extend the docker-compose.yml file @jocarbur is using for https://gpu1next.ific.uv.es/ with this functionality.

Once @jmbenlloch is back and it's convenient for @jocarbur as well I'll head over to ific to set it up and the three of us can talk about it.

miguelsimon commented 5 years ago

@jocarbur has set up the http server for static html files in the same docker-compose install that houses concourse and nginx.

The simple-cluster-tests.sh script now writes its .h5 outputs to the static file server and they're visible at https://gpu1next.ific.uv.es:4443/downloads/

miguelsimon commented 5 years ago

After discussing it with @jocarbur and @jmbenlloch today, the next step is getting @jocarbur's current gpu1next configuration into source control so that it's trivial to redeploy if the machine crashes. To that end, we've decided that @jmbenlloch can:

There's one upside to the fact that nobody seems to care about using this haha: we can tear down the current gpu1next setup and deploy by cloning the repo and following the instructions. If that works we've validated that we can redeploy after machine failure and take it from there.

The required functionality is provided by the current setup, possible improvements: