Change ReduceLRUponNan callback to EarlyStopping

materialsvirtuallab / megnet

Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals

BSD 3-Clause "New" or "Revised" License

497 stars 155 forks source link

Change ReduceLRUponNan callback to EarlyStopping #364

Closed dgaines2 closed 2 years ago

dgaines2 commented 2 years ago

Use keras backend to change the learning rate in ReduceLRUponNan (instead of re-compiling the model) and make slight changes to the logging.

dgaines2 commented 2 years ago

I was having some issues with the current implementation of this callback, so I tried to make a change to it. Now that I'm looking at the unit test in test_callbacks.py, I have a quick question. Why is model.fit() called twice on gen? Should this reduce the learning rate twice or just once? Based on the test, it looks like it should just reduce once.

-- nevermind, I realized the first test is on self.model and subsequent tests are on model

chc273 commented 2 years ago

@dgaines2

Previous this was used because softmax calculations explode when calculating the exponential. I have hence fixed the softmax calculation algorithm, and this callback is no longer needed. If nan happens, I think a better approach is to go back and check the data and optimizers and make sure there isnt anything wrong with those.

Re the strange behavior of generator, the initial class was written in tensorflow 1.x and the previous version worked as expected. The tensorflow 2.x version is not compatible in some of the restarting mechanisms. I did not follow up on that because I think this callback is no longer needed. Please let me know otherwise.

dgaines2 commented 2 years ago

Hey Chi, My main reason to use this callback was for early stopping patience -- since the validation metrics aren't available to typical Keras callbacks with the way the main training loop is set up. Perhaps I could refactor this callback into an EarlyStopping callback that's usable without the _reduce_lr_and_load function. Not sure how you feel about this, but if you're open to it, I should be able to refactor it pretty quickly. Otherwise I can just write a callback on my own fork.

chc273 commented 2 years ago

@dgaines2 I think it is a great idea. It is definitely better to have just early stop separate from the current callback.

dgaines2 commented 2 years ago

So, the only issue I can see arising now is the conflict between the patience in ReduceLRUponNan and a different EarlyStopper callback -- both might be called on the same epoch end and probably log twice. I could get entirely get rid of patience in ReduceLRUponNan (if that's okay with you) or I can leave it as it is for backwards compatibility. Do you have any opinion on how to handle this?

chc273 commented 2 years ago

@dgaines2 Thanks. I think it makes more sense to get rid of patience in ReduceLRUponNan. IMO, this callback is not useful anymore. It is better to just have and use a new EarlyStopper

dgaines2 commented 2 years ago

I ended up removing ReduceLRUponNan entirely -- but if you'd rather I keep it and just remove the patience logic, I could do that instead. Let me know if you have any requested changes.

chc273 commented 2 years ago

@dgaines2 Merged. Thank you!