vturrisi / solo-learn

solo-learn: a library of self-supervised methods for visual representation learning powered by Pytorch Lightning
MIT License
1.39k stars 181 forks source link

Is there an e2e integration test on toy data? #303

Closed turian closed 1 year ago

turian commented 1 year ago

Describe the bug In doing a major refactor (e.g. switching to OmegaConf or hydra), it's not clear to me there is a full e2e integration test. What main(s) would be the best on which to test this?

Additional comments

I might be mistaken, but tests/ only contains unit tests. A full e2e test on the most common main method(s), on a toy dataset, could test many code paths and make sure that some refactor does as intended. (This came up because I wanted to try a hydra port but had no idea how to quickly test if there was breakage or some crazy MSE on the downstream score versus the expected.)

An unintended sideeffect is codecov will increase :)

vturrisi commented 1 year ago

Indeed having end-to-end tests on the methods themselves is something that we need. However, using a toy dataset is not something that can correctly evaluate all methods. A decent middle ground can probably be a subset of imagenet100 (let's say 10%) for a couple of epochs and check if the obtained results (accuracy and loss values) fall into a predefined range that we need to compute beforehand. What do you think @DonkeyShot21?

turian commented 1 year ago

@vturrisi Yeah, toy didn't mean synthetic necessarily. A tiny imagenet100 or tiny MNIST. (I suggest MNIST just because there are so few labels that fewer instances might make more sense.)

I googled quickly but couldn't find any tiny image datasets. But maybe you are familiar with some.

According to your profiles, 10% of 4m55 second epochs is about 30 seconds, on a GPU. You might consider, for this e2e test, using a smaller model than a big resnet so you can do it on CPU.

BTW, if you can decide upon a simple spec (which data set, which main functions you want to try, etc), I'm happy to contribute to the development.

COOL NOTE: Lightning Ecosystem CI allows you to "automate issue discovery for your projects against Lightning nightly and releases. You get CPUs, Multi-GPUs testing for free, and Slack notification alerts if issues arise!" Since you have over 500 stars, they will allow you to include solo-learn in their nightly CI and get access to their multigpus. (Is solo-learn multigpu enabled? I haven't poked around yet.)

I would suggest with writing simple e2e tests in your repo and then later adding them to the lightning nightly CI.

vturrisi commented 1 year ago

@turian I think the most important tests would be to validate the performance of the methods (as the other features are easily tested by the unit tests). Linear evaluation is also very decoupled from the methods, so I'm not so concerned with it. If 15 epochs is enough, we would need around 2 hours to run the tests for all the methods (assuming a 10% subset of imagenet100).

I'm also not sure how to manage data with github actions such that we can upload this imagenet100 subset (is this even possible?). The first step would be to check if we can upload datasets and then run all the current methods in that specific setting to gather some range of values for their losses and top-1 acc values to write the tests.

About Lightning CI, they reached out to us some time ago and we are already part of that. I didn't have time to look into it, so I'm probably not taking any advantage of that, but if we can use it for these new tests, it would be cool.

turian commented 1 year ago

@vturrisi

I think we're kinda talking about two separate things. I'm more interested in an e2e test that runs quickly and just makes sure nothing breaks. (Unit tests are cool but don't always test the handoff points between different units.)

You are interested in doing hardcore regression testing to make sure scores don't drop on a known dataset.

A few opinions on my e2e proposal:

FYI, Travis will offer free credits to academic / open source projects, but these get exhausted very quickly if you use huge testing matrices (like every python x every pytorch x every OS), so I'd use that judiciously and only as a periodic supplement to github actions. (Maybe every time something is merged to main, not every single push.)

Regarding your suggestion:

Overall, my suggestion is get the simple dumb fast e2e test working first (as I described above). Once that works, we can figure out how to do a proper e2e regression test on a "real" data set.

turian commented 1 year ago

For GPU/TPU testing, you might also consider asking CircleCI for a grant. I think that's what lightning uses, but I'm not sure. (They are commercial so of course they pay.)

image

But given the number of stars and cites for your paper, it seems to make sense.

vturrisi commented 1 year ago

@turian been quite busy this week, but I'll try to get back here as soon as possible. Regardless, the end to end tests that you mentioned can be easily done with cifar10 and even without gpu with GitHub actions. It's just a matter of defining the scripts in a similar way to what I did in tests/args/test_args.py, e.g. just define the scripts as strings, save them and call a subprocess to execute it.

For the tests that I mentioned, I think we don't need anything fancy or automatic, just a set of scripts that we could manually run every couple of versions (or before any major version) to properly assess that nothing got screwed, performance wise.

vturrisi commented 1 year ago

The latest commit has tests for all scripts in tests/scripts. I think they are sufficient for the purpose of checking if there's something wrong with any script or method. About having performance tests, I'll try to address this in the near future.