Automatic Ground Truth Generation for tests

Max-Bladen commented 2 years ago

As part of PR #206, there needs to be a more efficient automatic pipeline established that generates the ground truth files. Rather than storing them as .rda, they will be converted to JSON such that any changes can be viewed via Git.

System will be built such that if a environmental parameter (eg. GT_UPDATE) is set to TRUE, any tests found that aren't already represented in the existing set of test ground truths will be run and its output saved. If representation of that GT exists but is different, new version will replace old.

If GT_UPDATE is set to FALSE, all tests will be run with any messages, warnings, errors or any incongruence between GT and test values reported. This will temporarily suspend work on PR #206 as this will make that much easier going forward.

Max-Bladen commented 2 years ago

Brief update and question @aljabadi

I've got conversion between .rda and .json working pretty much properly now. Unfortunately, the toJSON() function (from rjson), while useful, caused much formatting and miss data issues. Hence, I've needed to build an intermediary format (Im calling it JRR, JSON R Representation) which allows for the perfect reconstruction of an .rda from .json. I believe that apart from a few edge cases (working on them now), we can freely swap between these two data types via JRR.

Here's the problem: .json is much less space efficient than .rda. Based on all the test GT files I've generated, converting an .rda to .json prior to any of my processing roughly quadruples the file size. Additionally, the .json in JRR format can be 300% to 700% the size of the associated .rda.

This means that staying below 500kB for all GT files may no longer be as easy as before. I thought that this would initially mean that visualising the diff when updating GT files wouldn't possible. I had an idea though. For both the old and new versions of the GTs, generate a .json at run time (takes on average 0.01 seconds on largest .rda). From this, a document (eg. gt.update.log) outlining the diff could be generated by comparing the two .json files.

The only change from the system we spoke about is that the diff would be inspected in this document rather than github - though including gt.update.log in the package would allow it to be viewed via git diffs.

Let me know what your thoughts are. Storing the .json files seems prohibitively expensive, but I will need some more time to implement this new system.

aljabadi commented 2 years ago

Thanks @Max-Bladen for looking into this at such great details. Very useful learnings. I do agree that the file size would be an issue in the long run. It looks to me that we are facing a choice between:

a) increasing the breadth of the output components being tested but adding more effort in i) inspecting why a test breaks and ii) updating the ground truth data upon model improvements &

b) reducing the breadth but also decreasing the complexity of development and thus improving developer experience.

In my opinion, the codebase is already complex, I really would love it if we could avoid binary files in the unit tests. Maybe we can keep these changes (those associated with testing for the full specification of the model) in an open PR and switch our focus on testing 90%+ of the code for simple checks (or checks agains light JSONs of say just the values of the loadings or some other key metrics/statistics). This way, the efforts are preserved and we can always revisit these if the user feedback warrants so. What are your thoughts? Useful resource: https://blog.dominodatalab.com/unit-testing-data-science

Max-Bladen commented 2 years ago

I certainly think developing this kinda of system will not only increase the rate at which I can achieve 90% coverage, but will make any future developers live much easier. What is the issue with the .rda files out of curiosity? I'm far from an expert but it seems they present the easiest (and most compact) way to assess our methods thoroughly.

Heres a few options and my opinions. If we don't want to use .rda's, then we can contain the ground truths within the test files. This was the first method I explored and was a massive headache to work with - plus it requires manual adjustment of ground truths. The next option is using wide breadth JSONs, but this will be prohibitively massive. The third option (as you outline) is using simple tests (eg. just the class of output or occasionally a small vector of values). stored as smaller JSONs or within the tests.

The third option I see as suffering from both of the former options. This will require manual adjustments to the test GTs if we make adjustments to error messages, returned classes, error class functionality, component names etc. Additionally, for the same size file, we could store significantly more information in an .rda than a JSON.

I understand that I'm a newbie and there are probably pros, cons and options which I'm missing, but it does seem like binary files are the best. Seeing as I've done most of the work, implementing a way to visualise any changes in the GT.rda files shouldn't require much more work. Personally, for all those reasons thats why I think its best to continue work on it but I also understand if you want me to ditch the binary files.

let me know what you think

Max-Bladen commented 2 years ago

@aljabadi if I could please get your input when you can. I'll just work on some other bug fixes until I hear your reply but this is the task I'd like to get rolled out asap

mixOmicsTeam / mixOmics

Automatic Ground Truth Generation for tests #219