Open arshamsam opened 2 years ago
The default convergence criteria is a bit too tight, and I usually got the similar message. So if dx and df are still that small (up to about 1e-4), usually that's fine.
are default values of dx and df these defined in Pyscf? which convergence is this?
Those are just the measures of convergence and it has no physical meaning in terms of DFT. This is the SCF-iteration convergence.
my question is what is the consequence of not converging SCF cycle? Does training ignore this?
There is an option in train.py
that you can set: --always_attach
. If it is True
, then the training includes the gradient from non-converging element. If False
, then it ignores this.
python train.py --record --logdir logs/raw_calcs --version lda_x --libxc "lda_x" --pyscf gives this error :
File "/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 297, in _save_model raise ValueError(".save_function() not set") ValueError: .save_function() not set Epoch 0: 100%|██████████| 42/42 [00:53<00:00, 1.27s/it, loss=nan, v_num=da_x]
is this the case? This error happens if a ModelCheckpoint instance is passed to the callback argument and not the checkpoint_callback argument of Trainer.
Can you try pytorch_lightning version 1.2 and show the full error message?
Epoch 0: 98%|█████████████████████████████████████████████████████████████████████████████████████▉ | 41/42 [00:52<00:01, 1.27s/it, loss=nan, v_num=da_x]converged SCF energy = -107.741186236086██████████████████████████████████████████████████████████████████████▍ | 6/7 [00:09<00:01, 1.37s/it]
Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00, 1.27s/it]
File "/home/samazadi/work/test_xcnn/../../xcnn/xcdnn2/train2.py", line 258, in
here is the problem when a new calculation start after one is stopped:
samazadi@samazadi-Precision-Tower-7910:~/xcnn/xcdnn2$ python train.py --record --logdir logs/raw_calcs --version lda_x --libxc "lda_x" --pyscf Version: lda_x Resuming the training from /home/samazadi/.local/lib/python3.9/site-packages/xcdnn2/logs/raw_calcs/default/lda_x/checkpoints/last.ckpt GPU available: True, used: False TPU available: None, using: 0 TPU cores /home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: GPU available but not used. Set the --gpus flag when calling the script. warnings.warn(*args, *kwargs) /home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: Disable automatic optimization with the trainer flag is deprecated and will be removed in v1.3.0!Please use the property on the LightningModule for disabling automatic optimization warnings.warn(args, **kwargs)
1 Trainable params
0 Non-trainable params
1 Total params
0.000 Total estimated model params size (MB)
Restored states from the checkpoint file at /home/samazadi/.local/lib/python3.9/site-packages/xcdnn2/logs/raw_calcs/default/lda_x/checkpoints/last.ckpt
/home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argument(try 32 which is the number of cpus on this machine) in the
DataLoaderinit to improve performance. warnings.warn(*args, **kwargs) /home/samazadi/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the
num_workersargument
(try 32 which is the number of cpus on this machine) in the DataLoader
init to improve performance.
warnings.warn(*args, **kwargs)
Training: 0it [00:00, ?it/s]
Output: 0.771933376789093
What's the problem? It's all warnings that can be safely ignored, unless I'm missing something. Also, could you please tidy up the error message next time? It's quite hard to read
The issue is that the run doesn't start, Sure, I just thought you may need more details
If that's the case, I usually break it with ctrl+c
and see where it breaks, or add some printed message to know where it spends a lot of time.
Well there is no running task to be stopped!
BTW, how do I know that calculations are converged successfully?
BTW, how do I know that calculations are converged successfully?
If you mean the SCF iteration, then it only produces warnings if it does not converge. If you mean the ML training, you can see its convergence plot from the tensorboard (it's probably somewhere in the code)
The issue is that the run doesn't start, Sure, I just thought you may need more details
There is an option in train.py
that specifies max_epochs
which is 1000, if your previous run already reaches this, restarting it with the same max_epochs
will produce the behaviour you described. The solution is just to set max_epochs
into a higher number.
I reduced the training/validation set: if tvset == 1:
train_atoms = ["H", "He", "Li", "Be", "B", "C"]
rootfinder doesn't converge: /home/samazadi/.local/lib/python3.9/site-packages/xcdnn2/evaluator.py:100: ConvergenceWarning: The rootfinder does not converge after 50 iterations. Best |dx|=3.282e-06, |f|=1.273e-06 at iter 6 warnings.warn(w.message, category=w.category)
can you give more details on convergence process ?