mfkasim1 / xcnn

xc deep neural network (XCNN) with a differentiable DFT
8 stars 4 forks source link

minimal training #2

Open arshamsam opened 2 years ago

arshamsam commented 2 years ago

I reduced the training/validation set: if tvset == 1:

train_atoms = ["H", "He", "Li", "Be", "B", "C"]

    train_atoms = ["C"]

    #val_atoms = ["N", "O", "F", "Ne"]
    val_atoms = ["N"]

elif tvset == 2:  # randomly selected
    # train_atoms = ["H", "Li", "B", "C", "O", "Ne"]
    train_atoms = ["C"]
    #val_atoms = ["He", "Be", "N", "F", "P", "S"]
    val_atoms = ["N"]

rootfinder doesn't converge: /home/samazadi/.local/lib/python3.9/site-packages/xcdnn2/evaluator.py:100: ConvergenceWarning: The rootfinder does not converge after 50 iterations. Best |dx|=3.282e-06, |f|=1.273e-06 at iter 6 warnings.warn(w.message, category=w.category)

can you give more details on convergence process ?

mfkasim1 commented 2 years ago

The default convergence criteria is a bit too tight, and I usually got the similar message. So if dx and df are still that small (up to about 1e-4), usually that's fine.

arshamsam commented 2 years ago

are default values of dx and df these defined in Pyscf? which convergence is this?

mfkasim1 commented 2 years ago

Those are just the measures of convergence and it has no physical meaning in terms of DFT. This is the SCF-iteration convergence.

arshamsam commented 2 years ago

my question is what is the consequence of not converging SCF cycle? Does training ignore this?

mfkasim1 commented 2 years ago

There is an option in train.py that you can set: --always_attach. If it is True, then the training includes the gradient from non-converging element. If False, then it ignores this.

arshamsam commented 2 years ago

python train.py --record --logdir logs/raw_calcs --version lda_x --libxc "lda_x" --pyscf gives this error :

File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 297, in _save_model raise ValueError(".save_function() not set") ValueError: .save_function() not set Epoch 0: 100%|██████████| 42/42 [00:53<00:00, 1.27s/it, loss=nan, v_num=da_x]

arshamsam commented 2 years ago

is this the case? This error happens if a ModelCheckpoint instance is passed to the callback argument and not the checkpoint_callback argument of Trainer.

mfkasim1 commented 2 years ago

Can you try pytorch_lightning version 1.2 and show the full error message?

arshamsam commented 2 years ago

Epoch 0: 98%|█████████████████████████████████████████████████████████████████████████████████████▉ | 41/42 [00:52<00:01, 1.27s/it, loss=nan, v_num=da_x]converged SCF energy = -107.741186236086██████████████████████████████████████████████████████████████████████▍ | 6/7 [00:09<00:01, 1.37s/it] Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00, 1.27s/it] File "/home/samazadi/work/test_xcnn/../../xcnn/xcdnn2/train2.py", line 258, in bestval = run_training(hparams) File "/home/samazadi/work/test_xcnn/../../xcnn/xcdnn2/train2.py", line 163, in run_training trainer.fit(plsystem, File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit results = self.accelerator_backend.train() File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/accelerators/cpu_accelerator.py", line 48, in train results = self.train_or_test() File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test results = self.trainer.train() File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train self.train_loop.run_training_epoch() File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_epoch self.trainer.run_evaluation(test_mode=False) File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 610, in run_evaluation self.evaluation_loop.on_evaluation_end() File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 109, in on_evaluation_end self.trainer.call_hook('on_validation_end', *args, *kwargs) File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 823, in call_hook trainer_hook(args, **kwargs) File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end callback.on_validation_end(self, self.get_model()) File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 167, in on_validation_end self.save_checkpoint(trainer, pl_module) File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 213, in save_checkpoint self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath) File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 494, in _save_top_k_checkpoints self._update_best_and_save(filepath, current, epoch, trainer, pl_module) File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 543, in _update_best_and_save self._save_model(filepath, trainer, pl_module) File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 297, in _save_model raise ValueError(".save_function() not set") ValueError: .save_function() not set Epoch 0: 100%|██████████| 42/42 [00:53<00:00, 1.27s/it, loss=nan, v_num=da_x]

arshamsam commented 2 years ago

here is the problem when a new calculation start after one is stopped:

samazadi@samazadi-Precision-Tower-7910:~/xcnn/xcdnn2$ python train.py --record --logdir logs/raw_calcs --version lda_x --libxc "lda_x" --pyscf Version: lda_x Resuming the training from /home/samazadi/.local/lib/python3.9/site-packages/xcdnn2/logs/raw_calcs/default/lda_x/checkpoints/last.ckpt GPU available: True, used: False TPU available: None, using: 0 TPU cores /home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: GPU available but not used. Set the --gpus flag when calling the script. warnings.warn(*args, *kwargs) /home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: Disable automatic optimization with the trainer flag is deprecated and will be removed in v1.3.0!Please use the property on the LightningModule for disabling automatic optimization warnings.warn(args, **kwargs)

| Name | Type | Params

0 | evl | PySCFEvaluator | 0

1 Trainable params 0 Non-trainable params 1 Total params 0.000 Total estimated model params size (MB) Restored states from the checkpoint file at /home/samazadi/.local/lib/python3.9/site-packages/xcdnn2/logs/raw_calcs/default/lda_x/checkpoints/last.ckpt /home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 32 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. warnings.warn(*args, **kwargs) /home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of thenum_workersargument (try 32 which is the number of cpus on this machine) in the DataLoader init to improve performance. warnings.warn(*args, **kwargs) Training: 0it [00:00, ?it/s] Output: 0.771933376789093

mfkasim1 commented 2 years ago

What's the problem? It's all warnings that can be safely ignored, unless I'm missing something. Also, could you please tidy up the error message next time? It's quite hard to read

arshamsam commented 2 years ago

The issue is that the run doesn't start, Sure, I just thought you may need more details

mfkasim1 commented 2 years ago

If that's the case, I usually break it with ctrl+c and see where it breaks, or add some printed message to know where it spends a lot of time.

arshamsam commented 2 years ago

Well there is no running task to be stopped!

arshamsam commented 2 years ago

BTW, how do I know that calculations are converged successfully?

mfkasim1 commented 2 years ago

BTW, how do I know that calculations are converged successfully?

If you mean the SCF iteration, then it only produces warnings if it does not converge. If you mean the ML training, you can see its convergence plot from the tensorboard (it's probably somewhere in the code)

mfkasim1 commented 2 years ago

The issue is that the run doesn't start, Sure, I just thought you may need more details

There is an option in train.py that specifies max_epochs which is 1000, if your previous run already reaches this, restarting it with the same max_epochs will produce the behaviour you described. The solution is just to set max_epochs into a higher number.