[WIP] Fix restart & multi-target dipole

tautomer commented 1 year ago

Both should be working fine, but I think I should do more tests at this moment, hence the "WIP" tag.

Restarting

It turns out that there are some bugs in the new implementation of restarting.

When neither map_location nor model_device is set, the code will try to move tensors around, which should not happen
When map_location is used, None will be assigned to evaluator's model_device.

Reloading model is suffered from 1) as well.

After fixing restart (again), the logic will be like this

Scenarios	behavior
map_location=None, model_device=None	Don't move tensors or set evaluator.model_device
map_location set, model_device=None	Don't move tensors, but set evaluator.model_device
map_location=None. model_device="cpu"	Don't move tensors, but set evaluator.model_device
map_location=None. model_device set	Move tensors and set evaluator.model_device
map_location set, model_device set	Error

The old code will treat the first case as the 4th one, which will throw an error. For scenario 2, model_device variable is unset, so the device is now determined from one tensor in the checkpoint. 2 and 3 are in the same if, so the same treatment. We can probably add one more if so model_device is only checked again if it's scenario 2.

Time for creating the unit tests? When testing manually, I forgot to include scenario 1.

Dipole

The multi-target version implementation looks fine. Training only dipole, I get different histograms if comparing state by state between 5 single-target nodes and one 5-target node, but the histograms do look similar. I believe it's working, but let me collect more evidence to be sure.

lubbersnick commented 1 year ago

Hiya.

1) let me know when tests are done and we can merge in. 2) I do not object to putting test scripts in the repo. At some hopefully-not-too-far date we can use an automated testing tool. For now I would suggest we put them in a new directory /tests/.py. An important constraint should be that they only refer to local directory structure - ideally just a completely self contained script- so they can be run without having to include any additional assets (or as few as possible) in the repository itself. Putting your tests in doesn't have to be part of this PR - and if we're feeling sketchy on how to do it we could open a new branch for that process.

Thanks!

tautomer commented 1 year ago

let me know when tests are done and we can merge in.

I will do more tests during the holiday.

2. I do not object to putting test scripts in the repo. At some hopefully-not-too-far date we can use an automated testing too

I have branch in my fork called unittest. Still very basic. For tesing realoding, etc, we have to include some files.

tautomer commented 1 year ago

Done testing.

I think both changes are good.

lanl / hippynn

[WIP] Fix restart & multi-target dipole #21

Restarting

Dipole