Tqdm and small compile - Githubissues

yallup commented 1 year ago

A small quality of life request in a progress bar for training.

A second small ( but potentially breaking change?) in wrapping the train step in a jit wrapper. Naively gives a factor 2 improvement in run time on CPU and GPU as far as I can see, but may need more advanced checking on other hardware.

More broadly I tried putting the training on a GPU and saw little speed up, I started digging through the code and I think there are quite a lot of conversions to numpy and non-TF operations. So at the moment there isn't much gain from GPU usage here, but the JIT compilation still looks advantageous.

htjb commented 1 year ago

Ahh thanks David! Both of these things looks super helpful. I did play around with jit compile a bit as I had seen it lead to improvements with some other work I had been doing but just hadn't got around to implementing it here. Should help solve #20.

This shouldn't be a breaking change but we will have to bump the version number up to 0.6.0.

@yallup need to add tqdm to the requirements.txt for the tests I think.

htjb commented 1 year ago

Hmm yes the numpy to tf conversions probably aren't helping with runtime. I think speed up from GPUs is limited for small models as well because there is an overhead to running on GPU.

htjb commented 1 year ago

Hi @yallup, have you had a chance to look at this? Be great to get these features merged in. Thanks!

htjb commented 1 year ago

For some reason, the clustering is causing a

ValueError: tf.function only supports singleton tf.Variables created on the first call. Make sure the tf.Variable is only created once or created outside tf.function. See https://www.tensorflow.org/guide/function#creating_tfvariables for more information.

error with the @tf.function(jit_compile=True). Happens on the test_maf_clustering() test which is the only one that has a clustering=True in the MAF(). If I turn this of the test passes.

htjb commented 1 year ago

I think it's because tensorflow is trying to create the function multiple times... in the for loop

for i in range(len(self.theta)):
        self.maf[i] = self._training(self.theta[i],
                                     self.sample_weights[i],
                                     self.maf[i], self.theta_min[i],
                                     self.theta_max[i])

which I can validate by running

self.maf[0] = self._training(self.theta[0],
                                 self.sample_weights[0],
                                 self.maf[0], self.theta_min[0],
                                 self.theta_max[0])
self.maf[1] = self._training(self.theta[1],
                                 self.sample_weights[1],
                                 self.maf[1], self.theta_min[1],
                                 self.theta_max[1])

which gives the same error when it gets to self.maf[1].

htjb commented 1 year ago

@yallup I've been trying to resolve this issue but couldn't do it. I'm keen to take advantage of the tqdm addition though so I have removed the @tf.function(jit_compile=True) decorator on the training step for now.

I will come back to this when I look at #31 again. I think I has the same error with the clustering and @tf.function decorator there as well. If the tests pass could you please squash and merge. Thanks!

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02% :tada:

Comparison is base (ae63d2a) 88.93% compared to head (b34b8f9) 88.95%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #26 +/- ## ========================================== + Coverage 88.93% 88.95% +0.02% ========================================== Files 4 4 Lines 470 471 +1 ========================================== + Hits 418 419 +1 Misses 52 52 ``` | [Files Changed](https://app.codecov.io/gh/htjb/margarine/pull/26?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Harry+Bevins) | Coverage Δ | | |---|---|---| | [margarine/maf.py](https://app.codecov.io/gh/htjb/margarine/pull/26?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Harry+Bevins#diff-bWFyZ2FyaW5lL21hZi5weQ==) | `95.97% <100.00%> (+0.01%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

yallup commented 1 year ago

Totally agree on putting a pin in the speedup stuff, it was just a little thing I was trying. I got a bit caught up trying out some other bits such as reducing translations between tf and numpy operations as that might be a bigger speedup in the long run, one to put a pin in and revisit I think!

htjb commented 1 year ago

No worries! Thanks for having a look at it. I started the process of rewriting the MAF class line by line this evening to try and eliminate some of the back and forth between numpy/tensorflow and to add in some tf.function decorators where possible. Seems to be going well and hopefully that will help with run time issues. I'll put a PR in at some point.

htjb / margarine

Tqdm and small compile #26

Codecov Report