Is there an e2e integration test on toy data?

turian commented 1 year ago

Describe the bug In doing a major refactor (e.g. switching to OmegaConf or hydra), it's not clear to me there is a full e2e integration test. What main(s) would be the best on which to test this?

Additional comments

I might be mistaken, but tests/ only contains unit tests. A full e2e test on the most common main method(s), on a toy dataset, could test many code paths and make sure that some refactor does as intended. (This came up because I wanted to try a hydra port but had no idea how to quickly test if there was breakage or some crazy MSE on the downstream score versus the expected.)

An unintended sideeffect is codecov will increase :)

vturrisi commented 1 year ago

Indeed having end-to-end tests on the methods themselves is something that we need. However, using a toy dataset is not something that can correctly evaluate all methods. A decent middle ground can probably be a subset of imagenet100 (let's say 10%) for a couple of epochs and check if the obtained results (accuracy and loss values) fall into a predefined range that we need to compute beforehand. What do you think @DonkeyShot21?

turian commented 1 year ago

@vturrisi Yeah, toy didn't mean synthetic necessarily. A tiny imagenet100 or tiny MNIST. (I suggest MNIST just because there are so few labels that fewer instances might make more sense.)

I googled quickly but couldn't find any tiny image datasets. But maybe you are familiar with some.

According to your profiles, 10% of 4m55 second epochs is about 30 seconds, on a GPU. You might consider, for this e2e test, using a smaller model than a big resnet so you can do it on CPU.

BTW, if you can decide upon a simple spec (which data set, which main functions you want to try, etc), I'm happy to contribute to the development.

COOL NOTE: Lightning Ecosystem CI allows you to "automate issue discovery for your projects against Lightning nightly and releases. You get CPUs, Multi-GPUs testing for free, and Slack notification alerts if issues arise!" Since you have over 500 stars, they will allow you to include solo-learn in their nightly CI and get access to their multigpus. (Is solo-learn multigpu enabled? I haven't poked around yet.)

I would suggest with writing simple e2e tests in your repo and then later adding them to the lightning nightly CI.

vturrisi commented 1 year ago

@turian I think the most important tests would be to validate the performance of the methods (as the other features are easily tested by the unit tests). Linear evaluation is also very decoupled from the methods, so I'm not so concerned with it. If 15 epochs is enough, we would need around 2 hours to run the tests for all the methods (assuming a 10% subset of imagenet100).

I'm also not sure how to manage data with github actions such that we can upload this imagenet100 subset (is this even possible?). The first step would be to check if we can upload datasets and then run all the current methods in that specific setting to gather some range of values for their losses and top-1 acc values to write the tests.

About Lightning CI, they reached out to us some time ago and we are already part of that. I didn't have time to look into it, so I'm probably not taking any advantage of that, but if we can use it for these new tests, it would be cool.

turian commented 1 year ago

@vturrisi

I think we're kinda talking about two separate things. I'm more interested in an e2e test that runs quickly and just makes sure nothing breaks. (Unit tests are cool but don't always test the handoff points between different units.)

You are interested in doing hardcore regression testing to make sure scores don't drop on a known dataset.

A few opinions on my e2e proposal:

Regardless of unit tests, major refactors tend to break the glue between units. So one primary goal of an e2e test is to make sure everything runs start to finish.
A full e2e test should be <= 20 minutes, otherwise you will tear your hair out waiting for everything to finish.
Imagenet100 is like 20GB? Yeah. My suggestion is find a tiny dataset like MNIST 10% or whatever that is 100MB and create a separate github repo for it. (Don't use zenodo, it's flaky.) Then your git action just wgets from your separate repo. And your final check is that on this tiny dataset you don't have a score regression.

FYI, Travis will offer free credits to academic / open source projects, but these get exhausted very quickly if you use huge testing matrices (like every python x every pytorch x every OS), so I'd use that judiciously and only as a periodic supplement to github actions. (Maybe every time something is merged to main, not every single push.)

Regarding your suggestion:

I like the idea of making sure that scores don't fluctuate much.
2 hours using GPUs or CPUs? If GPUs, it's hard to find CI/CD providers that give GPUs :\ Lightning CI might be an option?
You probably want something automated. Running colab scripts occasionally is a pain. If you have lab GPUs you could set up a cronjob. Alternately, you can spin up AWS spot instances using spotty and that should cost about $1/hr. Still, these ideas require extra annoying tooling.
Lightning CI I think is cool too. We can talk to them about your needs and they might be the best solution overall. (I speak with them frequently too.)

Overall, my suggestion is get the simple dumb fast e2e test working first (as I described above). Once that works, we can figure out how to do a proper e2e regression test on a "real" data set.

turian commented 1 year ago

For GPU/TPU testing, you might also consider asking CircleCI for a grant. I think that's what lightning uses, but I'm not sure. (They are commercial so of course they pay.)

But given the number of stars and cites for your paper, it seems to make sense.

vturrisi commented 1 year ago

@turian been quite busy this week, but I'll try to get back here as soon as possible. Regardless, the end to end tests that you mentioned can be easily done with cifar10 and even without gpu with GitHub actions. It's just a matter of defining the scripts in a similar way to what I did in tests/args/test_args.py, e.g. just define the scripts as strings, save them and call a subprocess to execute it.

For the tests that I mentioned, I think we don't need anything fancy or automatic, just a set of scripts that we could manually run every couple of versions (or before any major version) to properly assess that nothing got screwed, performance wise.

vturrisi commented 1 year ago

The latest commit has tests for all scripts in tests/scripts. I think they are sufficient for the purpose of checking if there's something wrong with any script or method. About having performance tests, I'll try to address this in the near future.

vturrisi / solo-learn

Is there an e2e integration test on toy data? #303