integrate with Lightning ecosystem CI

Borda commented 2 years ago

Hello and so happy to see you use Pytorch-Lightning! :tada: Just wondering if you already heard about quite the new Pytorch Lightning (PL) ecosystem CI where we would like to invite you to... You can check out our blog post about it: Stay Ahead of Breaking Changes with the New Lightning Ecosystem CI :zap: As you use PL framework for your cool project, we would like to enhance your experience and offer you safe updates to our future releases. At this moment, you run tests with a particular PL version, but it may accidentally happen that the next version will be incompatible with your project... :confused: We do not intend to change anything on our project side, but still here we have a solution - ecosystem CI with testing both - your and our latest development head we can find it very early and prevent releasing eventually bad version... :+1:

What is needed to do?

have some tests, including PL integration
add config to ecosystem CI - https://github.com/PyTorchLightning/ecosystem-ci

What will you get?

scheduled nightly testing configured for development/stable versions
slack notification if something went wrong to investigate
testing also on multi-GPU machine as our gift to you :rabbit:

Borda commented 2 years ago

in fact I already started this integration for you so if you could have look https://github.com/PyTorchLightning/ecosystem-ci/pull/27

vturrisi commented 2 years ago

Hi @Borda. Thank you for including us :) I'll take a look at the ecosystem. What should we add to the PR?

Borda commented 2 years ago

Hi @Borda. Thank you for including us :) I'll take a look at the ecosystem.

@vturrisi you are very welcome!

What should we add to the PR?

if you could check it and:

approve if it looks good to you
if you are on PL slack provide your user name so I can add you as the contact person for notifications

also, I was checking the compatibility for the latest master, and seems that one of your tests is not expecting loops in checkpoint so if you mind consider updating this test so we would be aligned also with the latest development state... https://github.com/PyTorchLightning/ecosystem-ci/runs/4934800624?check_suite_focus=true

vturrisi commented 2 years ago

Hi @Borda, there's a typo in the file name (solo-learn_pl-devlop.yaml -> solo-learn_pl-develop.yaml). Apart from that, my email is vt.turrisi@gmail.com, if you could add it. I'm not part of the slack channel, but I would be happy to join (also think @DonkeyShot21 would like to join as well).

I've just fixed the tests, it should now allow new keys to be there. Let me know if it worked. Also, we would like to integrate dali tests, but those are a bit more work, as installing it with our setup is not working sometimes.

Currently, our tests are not actually training the model itself so they are a bit limited in that regard. Eventually, we would want to run a couple of epochs for each method/dataset and catch potential errors (e.g. loss not going down and so on). Would this be possible?

Borda commented 2 years ago

I've just fixed the tests, it should now allow new keys to be there. Let me know if it worked.

just re-running the tests :rabbit:

Also, we would like to integrate dali tests, but those are a bit more work, as installing it with our setup is not working sometimes.

that would be cool... we can make this addition in follow-up PR (well if you add it to your codebase it will be run automatically, we may just add/build the DALI, btw, DALI makes sense only test on GPU, right?)

Currently, our tests are not actually training the model itself so they are a bit limited in that regard. Eventually, we would want to run a couple of epochs for each method/dataset and catch potential errors (e.g. loss not going down and so on). Would this be possible?

that sounds reasonable, for that we should rather use the GPU machine as it is more powerful so the short training would not take too much time :+1:

vturrisi commented 2 years ago

that would be cool... we can make this addition in follow-up PR (well if you add it to your codebase it will be run automatically, we may just add/build the DALI, btw, DALI makes sense only test on GPU, right?)

Yes, DALI only works for GPU (it works on CPU but has bad performance). We already have tests for dali in tests/dali and those work, it's just the installation that might not work straight from solo (I would say it's better if we can build from dali directly).

that sounds reasonable, for that, we should rather use the GPU machine as it is more powerful so the short training would not take too much time

Agreed. I can start to develop some new tests which directly call the bash scripts with a very small number of epochs and then we can decide on that. Does it sound good to you?

vturrisi / solo-learn

integrate with Lightning ecosystem CI #219