Open miguelsimon opened 5 years ago
@jjgomezcadenas here's my view of what needs to happen to get a simple and useful mechanism to do the version testing you guys need.
We're actually very close; I'm pretty sure I can get it working in a day or two given the correct credentials and the scripts, the functionality from the point of view of the CI system is already implemented using dummy scripts to simulate the cluster tests.
miguelsimon/IC
by multiple participantsmiguelsimon/IC
to nextic/IC
nextic/IC
Quick update, this mainly concerns @mmkekic, the others can safely ignore this issue for now:
I have access to the cluster and have successfully submitted a trivial job via qsub, yay!
The cluster-tests.sh script is slightly less trivial now, given the correct credentials:
There's rudimentary docs describing how to use fly execute to test out the script from local content.
The CI pipeline now does real work, check it out at https://ci.ific-invisible-cities.com/:
/data_extra2/mmkekic/example_inputs
Jobs are specified in python via a dsl, it's a heavyweight approach but life is too short to NOT automate this haha. For example, here's the current specification for the testing jobs, found in the miguel_jobs.py script:
config = Config(
bashrc="/home/icdev/.bashrc",
conda_sh="/home/icdev/miniconda/etc/profile.d/conda.sh",
conda_activate="IC-3.7-2018-11-14",
conda_lib="/home/icdev/miniconda/lib",
remote_dir="/data_extra2/icdev/miguel_scratch",
)
specs = [
CitySpec(
city="irene",
input_path="/data_extra2/mmkekic/example_inputs/run_6971_0009_trigger1_waveforms.h5",
output_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
ic_version="master",
),
CitySpec(
city="dorothea",
input_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
output_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_kdst.h5",
ic_version="master",
),
CitySpec(
city="irene",
input_path="/data_extra2/mmkekic/example_inputs/run_6971_0009_trigger1_waveforms.h5",
output_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
ic_version="pr",
),
CitySpec(
city="dorothea",
input_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
output_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_kdst.h5",
ic_version="pr",
),
]
This specification is compiled into a local artifact containing the versions and bash scripts that chain the jobs together (right now I'm just making them serially dependent, in future I'll use the implicit dependency graph based on their outputs to parallelize the jobs). It can be locally inspected and then rsynced to the cluster.
The runner submits the (now serially dependent) jobs and waits for the last job to complete, periodically printing the current job queue to screen. The last job writes a file to signal successful completion. Log output is noisy but easy to understand eg.
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
126614.majorana1 job_0.sh icdev 0 Q short
126615.majorana1 job_1.sh icdev 0 H short
126616.majorana1 job_2.sh icdev 0 H short
126617.majorana1 job_3.sh icdev 0 H short
126618.majorana1 all_ok.sh icdev 0 H short
I'm limited by my understanding of what we want to test (eg. you'll notice there's a single conda installation for both branches, it's pretty likely we'll want a way to compare different conda environments) so that will evolve as you guys tell me what the tests should be doing.
There's quite a lot of cargo culting on my side, especially with the conda environment, I'm sure it can be slimmed down and simplified considerably.
I've added functionality to express comparison jobs in addition to city jobs, see the updated job specification:
ComparePmapSpec(
master_path="/data_extra2/icdev/miguel_scratch/outputs/master_run_6971_0009_trigger1_pmaps.h5",
pr_path="/data_extra2/icdev/miguel_scratch/outputs/pr_run_6971_0009_trigger1_pmaps.h5",
output_path="/data_extra2/icdev/miguel_scratch/outputs/compare_run_6971_0009_trigger1_pmaps.txt",
ic_version="master",
)
Right now this only runs h5diff, reporting the output on nonzero exit status, but can easily be extended to accomodate arbitrary comparison functions and outputs. I'll ask you guys about that next week and can write it next weekend.
I understand these jobs should run on all the files, or a large subset of the files, at majorana1.ific.uv.es:/analysis/6971/hdf5/data
. The current approach can easily be made to emit the optimally parallel job sequencing but I need to check if I'm going to hit job scheduling limits, as a comparison on all those files will mean submitting thousands of jobs.
amazing work, Miguel! Thanks!
Question for @jmbenlloch:
To get a ballpark number for number of jobs, I counted the .h5 files in /analysis/6971/hdf5/data
[icdev@majorana1 data]$ ls *.h5 | wc
5944 5944 213984
and multiplied by 5 for reasons.
My plan is to emit all jobs and their dependencies to PBS; for simplicity I make them serially dependent now but I can easily build a minicompiler to get the dependency graph between jobs.
I'd like to emit them all and let PBS sort it out; basically I'd emit a long series of commands like:
job_0=$(qsub jobs/job_0.sh)
job_1=$(qsub -W depend=afterok:$job_0 jobs/job_1.sh)
job_2=$(qsub -W depend=afterok:$job_1 jobs/job_2.sh)
job_3=$(qsub -W depend=afterok:$job_2 depend=afterok:$job_1 jobs/job_3.sh)
job_4=$(qsub jobs/job_4.sh)
...
I'd rather not build my ad-hoc job control system to emit jobs in smaller batches if PBS will do the job; are there any fundamental limitations with this approach or do you think it'll work into the thousands of jobs?
I'm following the convention that each comparison job emits a directory as output; these are collected and can be emitted in a variety of output formats (eg. .txt for now, html coming).
I'm using h5diff to prototype this, and when the time comes I can just take whatever jobs you guys are running, emit them to a directory, and assemble them into fancy reports. For example, I can use nbconvert to emit html from ipython notebooks if you'd like to generate comparison reports using .ipynb files.
So I'll defer asking you about the comparison part until I'm ready to plug it in :)
Hi @miguelsimon, the work you are doing is awesome, looks very good :)
Regarding your question:
are there any fundamental limitations with this approach or do you think it'll work into the thousands of jobs?
I'm not aware of any limitation in that sense, but it is also true that I have never tried to launch that amount of dependent jobs... I'd say you try out, if everything works, it is probably the easiest solution.
Haha @jmbenlloch I like the empirical approach I'll test it out when nobody else is using the cluster.
The first version of html report generation is up; I'm not the worlds finest interface designer but it fits my desiderata:
Here's an example using h5diff on part of the /analysis/6971/hdf5/data set; the first screenshot illustrates the overview:
And the second one illustrates the toggling of detailed output for a job:
Next step is to upload those reports somewhere. It should be very easy to plug in your real comparison scripts in place of h5diff output so I'm focusing on other stuff.
@mmkekic will be at ific the week of the 10th of september and says that's a good time, so I'll visit you guys then and see if we can get the first version working that week.
The last remaining chunk of functionality we need is a public (static) http server to upload & browse comparison output results.
Once we have that, the whole system will be working end-to-end for PRs opened on the miguelsimon/IC repo and we can test it out; remaining stuff like implementing realistic comparison jobs instead of h5diff and compiling the job dependencies for optimum parallelism are very important but can be done incrementally, requiring less synchronization.
Given that this will all end up running on IFIC infrastructure anyway it'd be great if you guys could provide the static http server for this @jmbenlloch. I'd need:
ci-data.ific.uv.es
that's publicly accessibleIs that easy for you guys to set up in the next week or so, so I can get all my ducks in a row for the 10th?
Yes, BY ALL MEANS. Thanks Miguel!
On 27 Aug 2019, at 17:54, miguelsimon notifications@github.com wrote:
@mmkekic https://github.com/mmkekic will be at ific the week of the 10th of september and says that's a good time, so I'll visit you guys then and see if we can get the first version working that week.
The last remaining chunk of functionality we need is a public (static) http server to upload & browse comparison output results.
Once we have that, the whole system will be working end-to-end for PRs opened on the miguelsimon/IC https://github.com/miguelsimon/IC repo and we can test it out; remaining stuff like implementing realistic comparison jobs instead of h5diff and compiling the job dependencies for optimum parallelism are very important but can be done incrementally, requiring less synchronization.
Given that this will all end up running on IFIC infrastructure anyway it'd be great if you guys could provide the static http server for this @jmbenlloch https://github.com/jmbenlloch. I'd need:
a static http server running on some named host eg. ci-data.ific.uv.es that's publicly accessible a username and a private key (or password) with rsync permissions to the static html folder, so the CI can upload the comparison results the path of the static html folder Is that easy for you guys to set up in the next week or so, so I can get all my ducks in a row for the 10th?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/miguelsimon/IC/issues/3?email_source=notifications&email_token=AB5SIDY7ETTHQLAQERMYHW3QGVE23A5CNFSM4IFN6OE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IHDZI#issuecomment-525365733, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5SIDZ4JR3WI77QHCOMWDDQGVE23ANCNFSM4IFN6OEQ.
@jocarbur is experimenting with the deploy to see how it should fit into the ific infrastructure. The documentation needs an update: the role of .gitignored credentials
folder needs to be explained explicitly by @miguelsimon.
Next steps might be:
simple-cluster-tests.sh
script so html summaries are uploaded to the ific http serversimple-cluster-tests.sh
, this requires the following subgoals:
/data_extra2/mmkekic/example_inputs/run_6971_0009_trigger1_waveforms.h5
chosen by @miguelsimon at randomOur current objective is to implement one sensible histogram comparison for one city, as discussed on thursday.
I've refactored the code in line with that goal:
As described in the README the pipeline now generates the output h5 pmap files for PR and master and retrieves them from the cluster.
This should make it easy for @mmkekic to develop the first simple histogram-comparing script, which takes those output h5 pmaps as inputs.
Once the histogram-comparing script is ready I can easily take its output and format it as html, and get the pipeline to upload it to the ific http server once that's online. The task list on the post above has been updated to reflect this.
@jocarbur has successfully deployed concourse to https://gpu1next.ific.uv.es/.
We've decided to serve the html files from within the same docker-compose installation that runs concourse; I'll extend the docker-compose.yml file @jocarbur is using for https://gpu1next.ific.uv.es/ with this functionality.
Once @jmbenlloch is back and it's convenient for @jocarbur as well I'll head over to ific to set it up and the three of us can talk about it.
@jocarbur has set up the http server for static html files in the same docker-compose install that houses concourse and nginx.
icdev@html_repo:/downloads/
(this is only visible within the docker networking context)The simple-cluster-tests.sh script now writes its .h5 outputs to the static file server and they're visible at https://gpu1next.ific.uv.es:4443/downloads/
After discussing it with @jocarbur and @jmbenlloch today, the next step is getting @jocarbur's current gpu1next configuration into source control so that it's trivial to redeploy if the machine crashes. To that end, we've decided that @jmbenlloch can:
docker-compose.yml
file and save it as IC/concourse-ci/gpu1next-docker-compose.yml
launch_gpu1next_concourse
rule to the IC/concourse-ci/Makefile
following the example of the launch-prod-concourse ruleIC/concourse-ci/README.md
There's one upside to the fact that nobody seems to care about using this haha: we can tear down the current gpu1next setup and deploy by cloning the repo and following the instructions. If that works we've validated that we can redeploy after machine failure and take it from there.
The required functionality is provided by the current setup, possible improvements:
@mmkekic and @jmbenlloch let's use this issue to track stuff related to running cluster tests on PRs as we currently lack a common communications channel.
I created the issue in the fork as opposed to the main repo because I'm unsure on whether you want me to spam all the other IC contributors.
It's in the spirit of @jmbenlloch's proposal in the issue describing the problem but more lightweight, which means we should be able to get it to work quickly. Quickly means in one day if we've got the scripts that do the pbs submit and result summarization.
Conclusions I got from our conversation last thursday, I'll write them down before I forget:
miguelsimon/IC
while working on this feature, once we've got it working here I'll PRnextic/IC
so other people can evaluate and if it's acceptable maintenance will be taken over by @jmbenlloch and @mmkekic@jmbenlloch there was a bug in my build pipeline we would have caught if your laptop hadn't run out of space when running the local build haha, that's fixed now.
I'll be (somewhat) available during the weekends to work on this.