Hi everybody,

This topic is to discuss the software version tests (additional to the usual test functions) that we want to implement to make sure we do not end in a similar situation as the one we are currently in.

After scratching our heads for a while, @mmkekic and I would like to propose this scheme:

After each PR is approved and before merging, an automatic script will run in majorana cluster. The script will process a big dataset (1000k files of Kr, and a full run of RunV high energy data) starting from rwf (after decoding). This will be run twice, once with the current head of IC and once with the changes of the PR.
The script will compare all the values of the output by comparing the tables themselves, if ok, we all can be happy. If the output is not the same, then a series of plots and numbers that compare the data, at the level (and posterior ones) where the mismatch occurs, will be created.
A report will be written with the result of the comparison, if needed due to a comparison fail, the report will include a path to the plots in pdf, the numbers of relevance (like survival rates and the alike). This report and plots should be automatically uploaded to git and we should keep storage of them through git-lfs.
The reviewer shall check the report, if it fails, the differences should be understood and explained (by the developer) before merging the PR.

In addition to this, we would also like to add a new label to git, something like 'data change'. This is to be added to a PR that is known to change the data. This way we can quickly identify at which PRs there has been a change in the processing output in case, for some reason, something smells fishy again.

We already talked with @jmbenlloch about whether this kind of scheme is possible and he is pretty sure we can do it. Not easy, but doable. But not in a two week period as this would be a major work.

I understand that this will be a more general testing thing and that the current situation will be handled by Gonzalo and Alberto so we don't strictly need this full machinery for the current situation but for future reassurance.

Details will be worked out as we work on this but, as a general idea, this is it. Suggestions, doubts, etc.?

Hi, sounds like a very complete and ambitious plan to me! My only comment: lets not test only on Kr and trigger2 data, bit also on corresponding MC files (detsim output, probably). So, running on four datasets

Ambitious indeed. I see the risk that we are aiming too high and it takes too long. I would like that you guys propose a simpler prototype as a first step that can be quickly implemented and tested (e.g, 2 weeks),THEN we can go for the bigger thing.

Enviado desde mi iPhone On 4 Jul 2019, 11:48 +0200, msorel notifications@github.com, wrote:

Hi, sounds like a very complete and ambitious plan to me! My only comment: lets not test only on Kr and trigger2 data, bit also on corresponding MC files (detsim output, probably). So, running on four datasets — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

Yeah, it's not two weeks work that's for sure. But I'm somehow missing the point (or missed it yesterday), I thought the evaluation of the current mismatch was on Gonzalo's hands and the evaluation of the mismatch (if it's there) on MC was on Alberto's. On the other hand, @mmkekic and I where supposed to develop a set of tools so this does not happen again.

I understand that you would want this set of tools to be tested in the current situation but I fail to see why it's mandatory for it to be in two weeks. I mean, if this works, then the study of Gonzalo and Alberto is not really needed (and may help in this work or do other things) unless we feel that we need two independent cross-checks, is that the case?

My biggest doubt is, what are @mmkekic and I supposed to find, a solution for the current mismatch (urgent but already being looked into by others) or a way to avoid getting into the same situation? If it's the second one, which is what I understood, I do not think we need to rush this that much, specially considering August is a quiet month.

@msorel yeah, I agree, we should also take MC into account, of course.

Yeah, it's not two weeks work that's for sure. But I'm somehow missing the point (or missed it yesterday), I thought the evaluation of the current mismatch was on Gonzalo's hands and the evaluation of the mismatch (if it's there) on MC was on Alberto's. On the other hand, @mmkekic and I where supposed to develop a set of tools so this does not happen again.

Correct

I understand that you would want this set of tools to be tested in the current situation but I fail to see why it's mandatory for it to be in two weeks. I mean, if this works, then the study of Gonzalo and Alberto is not really needed (and may help in this work or do other things) unless we feel that we need two independent cross-checks, is that the case?

I simple want something simple independent of Alberto and Gonzalo's effort, that tests the production. Actually what you propose make all the sense in the World, but simplify, give me a prototype first (no need to push things to github, simplify the report, limit the bell ringing).

Why in two weeks? Because is a prototype and I want you to focus in making it work, rather than setting off into a long term stuff that takes 6 months to complete. We have been there before...

My biggest doubt is, what are @mmkekic and I supposed to find, a solution for the current mismatch (urgent but already being looked into by others) or a way to avoid getting into the same situation?

You are supposed to come with a TEST system that, when running in the current versions, loudly complains, and when running in the fixed versions (which will be available in two weeks) gives OK. To repeat, exactly what you want to do, but as simple as it gets (almost yes or no answer).

If it's the second one, which is what I understood, I do not think we need to rush this that much, specially considering August is a quiet month.

Yes, you need to rush the prototype, because I don't want to approve a new IC release until it passes your test. Your test may be as simple as you can get in two weeks, but I want it there to approve the release.

Then, during August, improve as much as you want, give me the whole thing at your leisure. But give me something that works in the schedule time.

In other words. I don't want to repeat again the experience of developing a piece of code during 6 months before first release, as has happened with Esmeralda. Let's do it step by step.

Ok, got it, crystal clear.

So we'll remove all git automatization, the simplest approach will be something similar to this:

José Vicente makes a common account in majorana with installed IC to the HEAD.
A script takes as input two different IC installations, the HEAD and the one to be tested as well as two config folders, one containing the config files for each installation.
The dev enters to the common account, checks that IC installation is at HEAD and installs his PR version. Then, runs the script and produces the report (it will be more or less detailed depending on how far we can go, I guess we can use Olivia for pmap plotting, try to add some general penthesilea and esmeralda distribution plots and add some survival rates).
The dev uploads the report to the PR.
The reviewer shall check the report, if it fails, the differences should be understood and explained (by the developer) before merging the PR.
If approved, produced data should be erased of majorana by the dev.

Probably this can be done, with more or less detail in the report, in these two weeks time so we'll start with that.

Exactly. THANKS.

To repeat. I think the idea of producing a fully automatized test suite is GREAT. But let's split the work in atoms...

These are my two cents:

@aretno consulted with yesterday on how to implement a system such as the one described in his first message. As you know, I don't really have time to write all the code needed, but I think it is a very interesting problem and couldn't resist to spend a couple of days mocking up a prototype with the most basic functionality so the people involved can understand my proposal and have an MVP (Minimum Viable Product) from which to continue working if the architecture is accepted.

TLDR;

Have a look at these PR’s:

-One not breaking anything: https://github.com/nextic/testprod/pull/2 -One breaking penthesilea: https://github.com/nextic/testprod/pull/3

Once the PR is approved, the system will start to work:

pending

On that repository, status reports do not appear on the PR, but if you go to each branch you will see that the first one is marked as good (report):

worked

Screenshot from 2019-07-05 17-20-02

While the second one is broken with this report:

failed

Screenshot from 2019-07-05 17-20-15

Extended version

I have created a github repo with all the code I've written, no documentation yet, but we can work on that.

This is a diagram of the system:

github_scheme

The main ideas are the following:

Github can send a JSON request to any REST API server (basically a POST request) anytime an event happens in a particular repository. In our case we are interested in the event of a PR being approved.

I have create a small web server using python+flask to receive the request from GitHub. Each time a new PR review event is received a new message is sent to a queue managed by RabbitMQ with information of the two versions of IC to be compared (the current master and the approved-not-merged PR). This will create a new status of "queued" in github for the latest commit in the PR.

The consumer/s for the "pull_request" queue will start a docker build process to create an image of the IC version for the PR commit and another one for the current master. After finishing the process it will push the images to docker hub to have them archived and tagged (the very same installation that will be used for the test production). If any of the two versions already exists (as is probable for the master), it would be pulled from docker hub and not rebuilt. Once finished, will send a message to the "production" queue with the two versions to be compared.

The consumer for the "production" queue will create all the jobs needed and send them to the PBS queue using qsub. Caveats:

Now only a test with 10 xe0nu MC files is run with penthesilea.
Here should be implemented all the logic needed to run different productions and a way to manage the config files for them (where the templates would be? How are they going to evolve along time? Etc.). No solution is provided for that.
The system is a passive one, each job before finishing will send a Post request to the web server that will marked that job in the database as finished.
Each time a job finished, the web server in the database whether the minimum number of jobs needed for each production has already finished or not. If true, a new message would be sent to the "comparison" queue.
NO ERROR CONTROL AT ALL NOW.

The consumer for the "comparison" queue will run the jobs needed to compare both production results. Currently, simply a check on the energy of hits written in the hdsts files. An html report with a histogram is produced too. The report is copied to a public directory in the web server. When the job is finished a message is sent to the "finished" queue.

The consumer for the "finished" queue will update the PR status in github will the final result (success or failure) including a link to the report.

General comments

Docker is a software packaging system that will create sorts of very lightweight "virtual machines" with all the files and libraries needed for a particular software. It pretty standard in the industry nowadays, widely used for cloud computing and microservices. In our case, the procedure to install IC is defined here: (link).
RabbitMQ is a message-broker software that allows us to send pending jobs to a queue until they are read by some process. The idea is to remove the time expensive jobs from the web server so we can send an immediate answer to github. Later on updates would be sent when the different steps are completed. Each queue can have as many subscribers (processes waiting for messages) as wanted, they can run in parallel. It is guaranteed that each message will be read by only one subscriber, so no production will be run twice.
MongoDB is a non-relational database. You can think of it as a JSON documents archive. It is much more flexible than a SQL one.
A REST API in a simplified fashion is a way of defining web services using the different HTTP verbs allowed (GET, POST, PUT, DELETE, etc.) with JSON payloads. Again, pretty standard everywhere. This is what we used to communicate with github in both directions.
PBS job scheduler… Well, all of you kwow this. The one in neutrinos, Majorana, etc.
In order to send github statuses, an Oauth token is needed. They can be created from GitHub settings interface. Currently, the software reads this parameter from an environment variable.

I have set up a virtual machine with all that software as an example in the Google Cloud Platform. I've also created a copy of IC repo and a fork to make PRs to it. You can see some sample PRs in the links above. They are already approved and the system has run as you can see in the status of the last commit in each PR.

If you want to play a little bit to see the system in action, feel free to fork the repo and make a PR. I don’t guarantee everything will work ok, it is very preliminary and I won't keep the virtual machine running forever as it costs money (now I'm on free credit, so I can have it there for some time while we discuss).

Having said all that, there are many things that has to be improved, some related to the tests we want to run (config file management, run in data and MC, how to compare the productions, report generation…) and others are technical (logs management, create some daemons so the web server and the message consumers can run automatically, devise a way to remove from Majorana old docker images, etc.)

And now the sad part: I won't be leading the development of this system, there are plenty of things to do and as all of you know I'm already busy with some mundane tasks such as finishing a PhD. In any case, I think you can use this code and ideas as a blueprint if you like them :)

next-exp / IC

Version testing #661

TLDR;

Extended version

General comments