pyt-team / TopoModelX

Topological Deep Learning
https://pyt-team.github.io/topomodelx/
MIT License
234 stars 82 forks source link

Diagnose & Speed-up Hypergraph tutorials #215

Closed ninamiolane closed 12 months ago

ninamiolane commented 1 year ago

What?

Testing the tutorials on hypergraphs takes ~15 minutes, whereas testing the tutorials on other domains takes ~2-5 minutes (see screenshot).

There is probably one tutorial on hypergraphs that takes very long and slows down the whole github action workflow.

Find out which one and whether it can be accelerated.

Why?

A slow testing workflow slows down all the contributors, who have to wait for all tests to pass before being able to move on.

Image

devendragovil commented 1 year ago

@ninamiolane

Analysis

I have analyzed the runtime for all unit tests. Hypergraph Tutorials indeed do take the longest durations. Please find the times of the longest 5 tests here:

Category Name Run Time (sec)
Hypergraph DHGCN. 208
Hypergraph Hypersage 176
Hypergraph UniGCNII. 81
Hypergraph UniGCN 42
Simplicial Scone 27

My observations:

  1. Individual test times are not that outrageous.
  2. It takes really long because all the tests are running sequentially

Deep Dive (DHGCN Tutorial)

All steps are taking reasonable amount of time (< 5 secs) except the last step which is a 5 epoch training run for the DHGCN Hypergraph TNN.

image

Observations

  1. Individual train times do seem reasonable to me (please correct me if I am wrong). These might speed up with GPU access
  2. Environment is built repeatedly. For tutorials the libraries are imported repeatedly.
  3. All tests (until recently) were being run sequentially.

Recommendations/Solutions

  1. We can arrange for GPU for the test-suite. I don't think Github actions provides a runner with GPU, we will need to arrange our own runner, which can be configured. However, configuration might be time consuming and hosting a GPU instance might be costly.
  2. We can do aggressive caching for our environment as well as libraries being imported.
  3. We can run tests in parallel. There are many libraries like pytest-xdist and pytest-split that enable this. Since Github Actions runners are single core, we can use the matrix strategy for parallelization. The tests can also be split based on the time they take to enable 5-7 (or as required) equally timed partitions. Since the longest test takes just over 3 minutes, that is the shortest time parallelization can achieve without making changes in tutorials themselves.
  4. A Naive Solution: Reducing number of epochs in tutorials. Reducing number of epochs in DHGCN from 5 to 1 reduces the time by a fifth, and DHGCN tutorial concludes within a minute.
ninamiolane commented 1 year ago

Excellent, thanks for the very detailed diagnosis. I agree with all your points and the solutions.

iv. I like the naive solution of reducing the number of epochs from 5 to 1, together with a comment in the text explaining that in real applications that number should be increased. @devendragovil could you do this?

i-iii. These are awesome solutions, but would take more time. Maybe we can deprioritize them for now? (there are a lot of other tasks remaining).

devendragovil commented 1 year ago

@ninamiolane
yes I can do this. I can also implement the 3rd solution as well, I was independently working on the same for some time, and should hopefully be able to do it by Sunday. Will that work if I implement the 3rd solution by Sunday?

devendragovil commented 1 year ago

Independently of this issue, I also wanted to know if Sunday is a reasonable target to resolve all (or most in case of getting totally stuck in an issue) issues assigned to me?

ninamiolane commented 1 year ago

Even better if you can do iii as well, thanks for offering!

Sunday is a perfect target of deadline 💯 Thanks for your great and fast work.

devendragovil commented 1 year ago

Thanks a lot!

devendragovil commented 1 year ago

@ninamiolane I fell ill after my travel back from India last week, so couldn't meet the timeline that I gave earlier. Sorry for that! I will try to complete all the issues asap. Thanks a lot for your consideration.

ninamiolane commented 1 year ago

Thanks for the heads-up, and sorry to hear that you feel ill. Stay safe!

ninamiolane commented 1 year ago

@devendragovil any update on this?

devendragovil commented 1 year ago

@ninamiolane Oh I am really sorry for the late response. I have raised a PR for this issue, run-times are now around 5.5-6 mins. I am stuck at one thing for a long time, it will help reduce overall run-time by 1-1.5 mins, but this PR helps reduce most of the time.