Open exalate-issue-sync[bot] opened 1 year ago
Neeraja Madabhushi commented: From Brandon detail description
1) Regression
This is rather straight-forward.For this, each test can compare against a benchmark score.Given that H2O currently ships, let us take the current scores as the benchmarks.As time changes, the values may change in different cases.The score can be the highest value seen from H20v3 see so far, or another target score chosen at a later date.
While some may want to see this as a pass/fail test, we need more out of it. For most algo. changes, we shall see some tests improve and others decline.It maybe that the declines are all extremely small, and the improvement large enough to outweigh the loses.It is really the role of the director in charge of releases to make the call of whether the change is acceptable.In this sense, we need to record the amount of the regression, and display this in a way that is constructive for that director.
2) Performance comparison
This goal is helpful to understand our strengths and weaknesses with regard to our competitors.Again, we want to try to understand how our current scores compare to chosen competitors.In this case, the competitors algos don’t need to be run daily.The codes can be updated periodically, run, and the scores can be collected.There are two aspects of this, mechanics and data science.
For mechanics, we need to make it simple to add and change the competitor runs.We also need a system that can grab these results, display how H2O currently fairs in each case, and summarize larger trends.The first task is some scripting that I think still needs to be figured out. For the display, I suspect it should be a rather simple script or tool that will solve it.
For the data science, there are a lot of hard questions.We can precede by answering the easiest first to get some initial results.So we should mark all the cases where we have apples to apples comparisons and get these results.At the very least, this can be done with H2Ov2 with defaults vs H2Ov3 defaults. For many more cases, this will be an incomplete look.Different algorithm implementations will benefit from different parameters.Grid searches will need to be done to discover what our best parameters are, and what are the best parameters for the competing algorithms. These parameters can be recorded and used from run to run.Periodic updates will probably be necessary. This will give us two scores per algorithm implementation.A score with default parameters, a score with tuned parameters.We can get these two scores for H2O and other algos that make sense to do so with. Again building proper infrastructure (mechanics) should help keep this process repeatable even when different cases may be comparing against different algorithms.
The more we’ve looked at this, I believe a database is useful here.For now that can be a flat file. Having a history allows us to visualize where we’ve been and how we are improving.What about this as a basic schema? Note that it is fine to mix H2O and other results.
test-case-id;
training frame id;
validation frame id;
metric_type (e.g. AUC, MSE);
result;
date;
source (e.g. H2Ov2, gmlnet, H2Ov3);
parameter list/run command;
git_hash/version number;
tuned/defaults
Can you think of other items that should be in the list?With all results being stored, we can then generate different views as desired.All different results views are just an SQL query away.
This design allows the system to provide views for several different consumers:
-Sales engineers need to know where we stand compared to other algorithms. -Developers want to see the effects of a change to a specific algorithm across a spectrum of datasets -Release managers want to see what the pro’s and con’s are of current changes -Upper management wants to see how we are improving over time
Getting There
We currently have the code to read in our main parameters spreadsheet and run tests.With a small change to these tests we can store the results into simple database (or flatfile for now).I may be able to have Jenkins collect this for now. Can we run only subsets easily?This would be useful for developers. As time progresses, larger tests will be added, and re-running all tests at all times doesn’t make good use of resources.
From this alone, we can generate a script to find regressions in the db or flatfile.
For the performance comparison, we need a few things:
-identify the low hanging fruit and get immediate results -figure out a system to allow re-running each competitor test and capturing results -grid searches to find optimal parameters to build up “our best” vs “their best” comparisons
Brandon Hill commented: Here's the schema items with a few additions from the schema spencer had for the performance DB in H2Ov2.
test-case-id; training frame id; validation frame id; metric_type (e.g. AUC, MSE); result; date; interpreter_version; (e.g. JVM, Python, or R version) machine_name; (e.g. mr-0xb4) total_hosts; cpus_per_hosts; total_nodes; source (e.g. H2Ov2, gmlnet, H2Ov3); parameter list/run command; git_hash/version number; tuned/defaults;
JIRA Issue Migration Info
Jira Issue: PUBDEV-1913 Assignee: DisabledN Reporter: Neeraja Madabhushi State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
Jira to track testNG improvements:
1) Functionality to run single tests or range of tests 2) Param data to be accessed from test cases document based on column headers. 3) From Brandon :(Detail description in comments) 1) knowing whether accuracy has changed, how much, and in which cases 2) knowing our accuracy compared to competitors