How to train large number of environments

libAtoms / QUIP

libAtoms/QUIP molecular dynamics framework: https://libatoms.github.io

349 stars 121 forks source link

How to train large number of environments #236

Open bpfrd opened 4 years ago

bpfrd commented 4 years ago

Hello For the purpose of comparison,, I have to train large number of configurations (for example 20000 configurations 40*20000 atomic environments) with GAP and SOAP fingerprint. I get the below error even when I put in 900G memory in my submission script. I was wondering if it is possible to resolve that? for 500 configurations it works properly. ulimit -s unlimited set OMP_STACKSIZE=10000000000000000000 result: SYSTEM ABORT: Traceback (most recent call last - error kind IO): File "quip.f95", line 337 kind unspecified File "CInOutput.f95", line 922 kind unspecified File "CInOutput.f95", line 579 kind unspecified File "/kernph/dedeb/bin/QUIP/src/libAtoms/xyz.c", line 765 kind IO Missing value for parameter "dft_energy"

forrtl: error (76): Abort trap signal Image PC Routine Line Source quip 0000000000E5A185 Unknown Unknown Unknown libpthread-2.17.s 00002B917B2945F0 Unknown Unknown Unknown libc-2.17.so 00002B917B6DB337 gsignal Unknown Unknown libc-2.17.so 00002B917B6DCA28 abort Unknown Unknown quip 0000000000405C06 Unknown Unknown Unknown quip 00000000009A2706 error_module_mp_e 319 error.f95 quip 00000000009A25B0 error_module_mp_e 340 error.f95 quip 00000000004083E4 MAIN 337 quip.f95 quip 0000000000EAFE96 Unknown Unknown Unknown libc-2.17.so 00002B917B6C7505 libc_start_main Unknown Unknown quip 0000000000405C2F Unknown Unknown Unknown srun: error: shi72: task 0: Aborted (core dumped) Best regards Behnam

gabor1 commented 4 years ago

um… the clue is >>>>>Missing value for parameter “dft_energy”<<<<<

Your XYZ file is faulty.

For comparison, we use a 1500 GB memory machine to train about 400,000 scalar data points and 10,000 sparse points (basis functions). You have 20,000+34020,000 = 2,420,000 scalar data points (I assume you have forces). We’ve never trained that big a database (because it was never needed!). you will have to reduce the number of sparse points. I really would recommend against it. It you are comparing against a neural network, you should stick to a good number fo sparse points that gives you high accuracy, and slowly increase the number of input data configurations until you reach your desired accuracy (maybe the same or better than NN), and see how many configurations you needed to achieve that. Why go further?

Also, you can’t make your stack size that big. (And if the system lets you, it may reduce available memory on the heap!)

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 14 Sep 2020, at 22:19, bpfrd notifications@github.com wrote:

Hello For the purpose of comparison,, I have to train large number of configurations (for example 20000 configurations 40*20000 atomic environments) with GAP and SOAP fingerprint. I get the below error even when I put in 900G memory in my submission script. I was wondering if it is possible to resolve that? for 500 configurations it works properly. ulimit -s unlimited set OMP_STACKSIZE=10000000000000000000 result: SYSTEM ABORT: Traceback (most recent call last - error kind IO): File "quip.f95", line 337 kind unspecified File "CInOutput.f95", line 922 kind unspecified File "CInOutput.f95", line 579 kind unspecified File "/kernph/dedeb/bin/QUIP/src/libAtoms/xyz.c", line 765 kind IO Missing value for parameter "dft_energy"

forrtl: error (76): Abort trap signal Image PC Routine Line Source quip 0000000000E5A185 Unknown Unknown Unknown libpthread-2.17.s 00002B917B2945F0 Unknown Unknown Unknown libc-2.17.so 00002B917B6DB337 gsignal Unknown Unknown libc-2.17.so 00002B917B6DCA28 abort Unknown Unknown quip 0000000000405C06 Unknown Unknown Unknown quip 00000000009A2706 error_module_mp_e 319 error.f95 quip 00000000009A25B0 error_module_mp_e 340 error.f95 quip 00000000004083E4 MAIN 337 quip.f95 quip 0000000000EAFE96 Unknown Unknown Unknown libc-2.17.so 00002B917B6C7505 libc_start_main Unknown Unknown quip 0000000000405C2F Unknown Unknown Unknown srun: error: shi72: task 0: Aborted (core dumped) Best regards Behnam

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

bpfrd commented 4 years ago

Thanks for your reply. I double checked the xyz file. It seems correct to me. It has the following format: 81 dft_energy=-0.83450731467729398E+01 pbc="T T T" Lattice="40.00000000 0.00000000 0.00000000 0.00000000 40.00000000 0.00000000 0.00000000 0.00000000 40.00000000" Properties=species:S:1:pos:R:3:Z:I:1:dft_force:R:3 Au 0.19521263724407511E+02 0.11970511101910786E+02 0.32523314883618269E+02 79 0.37124238247191382E-03 -0.15116878266811149E-02 0.15951131030196401E-02 Au 0.21771924683745674E+02 0.24413529246805833E+02 0.11657368253981389E+02 79 -0.38378602492135914E-03 0.73225724884043203E-03 -0.31714109368341596E-03 The problem is that it can not read large datasets, I think. I successfully trained a small training data and then tried to test it on a large validation data and I got the same error. specifically in this file: quip.f95", line 337 Best regards Behnam

On Tue, Sep 15, 2020 at 2:28 AM gabor1 notifications@github.com wrote:

um… the clue is >>>>>Missing value for parameter “dft_energy”<<<<<

Your XYZ file is faulty.

For comparison, we use a 1500 GB memory machine to train about 400,000 scalar data points and 10,000 sparse points (basis functions). You have 20,000+34020,000 = 2,420,000 scalar data points (I assume you have forces). We’ve never trained that big a database (because it was never needed!). you will have to reduce the number of sparse points. I really would recommend against it. It you are comparing against a neural network, you should stick to a good number fo sparse points that gives you high accuracy, and slowly increase the number of input data configurations until you reach your desired accuracy (maybe the same or better than NN), and see how many configurations you needed to achieve that. Why go further?

Also, you can’t make your stack size that big. (And if the system lets you, it may reduce available memory on the heap!)

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 14 Sep 2020, at 22:19, bpfrd notifications@github.com wrote:

Hello For the purpose of comparison,, I have to train large number of configurations (for example 20000 configurations 40*20000 atomic environments) with GAP and SOAP fingerprint. I get the below error even when I put in 900G memory in my submission script. I was wondering if it is possible to resolve that? for 500 configurations it works properly. ulimit -s unlimited set OMP_STACKSIZE=10000000000000000000 result: SYSTEM ABORT: Traceback (most recent call last - error kind IO): File "quip.f95", line 337 kind unspecified File "CInOutput.f95", line 922 kind unspecified File "CInOutput.f95", line 579 kind unspecified File "/kernph/dedeb/bin/QUIP/src/libAtoms/xyz.c", line 765 kind IO Missing value for parameter "dft_energy"

forrtl: error (76): Abort trap signal Image PC Routine Line Source quip 0000000000E5A185 Unknown Unknown Unknown libpthread-2.17.s 00002B917B2945F0 Unknown Unknown Unknown libc-2.17.so 00002B917B6DB337 gsignal Unknown Unknown libc-2.17.so 00002B917B6DCA28 abort Unknown Unknown quip 0000000000405C06 Unknown Unknown Unknown quip 00000000009A2706 error_module_mp_e 319 error.f95 quip 00000000009A25B0 error_module_mp_e 340 error.f95 quip 00000000004083E4 MAIN 337 quip.f95 quip 0000000000EAFE96 Unknown Unknown Unknown libc-2.17.so 00002B917B6C7505 libc_start_main Unknown Unknown quip 0000000000405C2F Unknown Unknown Unknown srun: error: shi72: task 0: Aborted (core dumped) Best regards Behnam

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-692335951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXONVZ2YJTQWHO3BEDYPFDSF2GXRANCNFSM4RMCY5WQ .

gabor1 commented 4 years ago

ok. So if your data file is so huge that we can’t read it in (and I agree with you that the test you did by just evaluating on it is a good test), then there is no quick fix. I bet you don’t need this much data to beat other methods that use the data sequentially (batch-training). We are unlikely to implement batch training because it’s not needed to get good models…

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 15 Sep 2020, at 11:20, bpfrd notifications@github.com wrote:

Thanks for your reply. I double checked the xyz file. It seems correct to me. It has the following format: 81 dft_energy=-0.83450731467729398E+01 pbc="T T T" Lattice="40.00000000 0.00000000 0.00000000 0.00000000 40.00000000 0.00000000 0.00000000 0.00000000 40.00000000" Properties=species:S:1:pos:R:3:Z:I:1:dft_force:R:3 Au 0.19521263724407511E+02 0.11970511101910786E+02 0.32523314883618269E+02 79 0.37124238247191382E-03 -0.15116878266811149E-02 0.15951131030196401E-02 Au 0.21771924683745674E+02 0.24413529246805833E+02 0.11657368253981389E+02 79 -0.38378602492135914E-03 0.73225724884043203E-03 -0.31714109368341596E-03 The problem is that it can not read large datasets, I think. I successfully trained a small training data and then tried to test it on a large validation data and I got the same error. specifically in this file: quip.f95", line 337 Best regards Behnam

On Tue, Sep 15, 2020 at 2:28 AM gabor1 notifications@github.com wrote:

um… the clue is >>>>>Missing value for parameter “dft_energy”<<<<<

Your XYZ file is faulty.

For comparison, we use a 1500 GB memory machine to train about 400,000 scalar data points and 10,000 sparse points (basis functions). You have 20,000+34020,000 = 2,420,000 scalar data points (I assume you have forces). We’ve never trained that big a database (because it was never needed!). you will have to reduce the number of sparse points. I really would recommend against it. It you are comparing against a neural network, you should stick to a good number fo sparse points that gives you high accuracy, and slowly increase the number of input data configurations until you reach your desired accuracy (maybe the same or better than NN), and see how many configurations you needed to achieve that. Why go further?

Also, you can’t make your stack size that big. (And if the system lets you, it may reduce available memory on the heap!)

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 14 Sep 2020, at 22:19, bpfrd notifications@github.com wrote:

Hello For the purpose of comparison,, I have to train large number of configurations (for example 20000 configurations 40*20000 atomic environments) with GAP and SOAP fingerprint. I get the below error even when I put in 900G memory in my submission script. I was wondering if it is possible to resolve that? for 500 configurations it works properly. ulimit -s unlimited set OMP_STACKSIZE=10000000000000000000 result: SYSTEM ABORT: Traceback (most recent call last - error kind IO): File "quip.f95", line 337 kind unspecified File "CInOutput.f95", line 922 kind unspecified File "CInOutput.f95", line 579 kind unspecified File "/kernph/dedeb/bin/QUIP/src/libAtoms/xyz.c", line 765 kind IO Missing value for parameter "dft_energy"

forrtl: error (76): Abort trap signal Image PC Routine Line Source quip 0000000000E5A185 Unknown Unknown Unknown libpthread-2.17.s 00002B917B2945F0 Unknown Unknown Unknown libc-2.17.so 00002B917B6DB337 gsignal Unknown Unknown libc-2.17.so 00002B917B6DCA28 abort Unknown Unknown quip 0000000000405C06 Unknown Unknown Unknown quip 00000000009A2706 error_module_mp_e 319 error.f95 quip 00000000009A25B0 error_module_mp_e 340 error.f95 quip 00000000004083E4 MAIN 337 quip.f95 quip 0000000000EAFE96 Unknown Unknown Unknown libc-2.17.so 00002B917B6C7505 libc_start_main Unknown Unknown quip 0000000000405C2F Unknown Unknown Unknown srun: error: shi72: task 0: Aborted (core dumped) Best regards Behnam

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-692335951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXONVZ2YJTQWHO3BEDYPFDSF2GXRANCNFSM4RMCY5WQ .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

bpfrd commented 4 years ago

Thanks for your reply. We actually have a very diverse data set and when we train a smaller data set, we don't get the desired accuracy for the predicted forces (Please see below for the params we used). I was wondering if you also have a standalone version of soap fingerprint and its derivatives so that we can connect it to NN?

default_sigma={0.008 0.04 0 0} \

gap={ soap cutoff=6.0 \

covariance_type=dot_product \

zeta=2 \

delta=0.016 \

atom_sigma=0.7 \

l_max=4 \

n_max=8 \

n_sparse=100 \

sparse_method=cur_points} 2>&1 | grep -v FoX

Best regards Behnam

On Tue, Sep 15, 2020 at 3:45 PM gabor1 notifications@github.com wrote:

ok. So if your data file is so huge that we can’t read it in (and I agree with you that the test you did by just evaluating on it is a good test), then there is no quick fix. I bet you don’t need this much data to beat other methods that use the data sequentially (batch-training). We are unlikely to implement batch training because it’s not needed to get good models…

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 15 Sep 2020, at 11:20, bpfrd notifications@github.com wrote:

Thanks for your reply. I double checked the xyz file. It seems correct to me. It has the following format: 81 dft_energy=-0.83450731467729398E+01 pbc="T T T" Lattice="40.00000000 0.00000000 0.00000000 0.00000000 40.00000000 0.00000000 0.00000000 0.00000000 40.00000000" Properties=species:S:1:pos:R:3:Z:I:1:dft_force:R:3 Au 0.19521263724407511E+02 0.11970511101910786E+02 0.32523314883618269E+02 79 0.37124238247191382E-03 -0.15116878266811149E-02 0.15951131030196401E-02 Au 0.21771924683745674E+02 0.24413529246805833E+02 0.11657368253981389E+02 79 -0.38378602492135914E-03 0.73225724884043203E-03 -0.31714109368341596E-03 The problem is that it can not read large datasets, I think. I successfully trained a small training data and then tried to test it on a large validation data and I got the same error. specifically in this file: quip.f95", line 337 Best regards Behnam

On Tue, Sep 15, 2020 at 2:28 AM gabor1 notifications@github.com wrote:

um… the clue is >>>>>Missing value for parameter “dft_energy”<<<<<

Your XYZ file is faulty.

For comparison, we use a 1500 GB memory machine to train about 400,000 scalar data points and 10,000 sparse points (basis functions). You have 20,000+34020,000 = 2,420,000 scalar data points (I assume you have forces). We’ve never trained that big a database (because it was never needed!). you will have to reduce the number of sparse points. I really would recommend against it. It you are comparing against a neural network, you should stick to a good number fo sparse points that gives you high accuracy, and slowly increase the number of input data configurations until you reach your desired accuracy (maybe the same or better than NN), and see how many configurations you needed to achieve that. Why go further?

Also, you can’t make your stack size that big. (And if the system lets you, it may reduce available memory on the heap!)

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 14 Sep 2020, at 22:19, bpfrd notifications@github.com wrote:

Hello For the purpose of comparison,, I have to train large number of configurations (for example 20000 configurations 40*20000 atomic environments) with GAP and SOAP fingerprint. I get the below error even when I put in 900G memory in my submission script. I was wondering if it is possible to resolve that? for 500 configurations it works properly. ulimit -s unlimited set OMP_STACKSIZE=10000000000000000000 result: SYSTEM ABORT: Traceback (most recent call last - error kind IO): File "quip.f95", line 337 kind unspecified File "CInOutput.f95", line 922 kind unspecified File "CInOutput.f95", line 579 kind unspecified File "/kernph/dedeb/bin/QUIP/src/libAtoms/xyz.c", line 765 kind IO Missing value for parameter "dft_energy"

forrtl: error (76): Abort trap signal Image PC Routine Line Source quip 0000000000E5A185 Unknown Unknown Unknown libpthread-2.17.s 00002B917B2945F0 Unknown Unknown Unknown libc-2.17.so 00002B917B6DB337 gsignal Unknown Unknown libc-2.17.so 00002B917B6DCA28 abort Unknown Unknown quip 0000000000405C06 Unknown Unknown Unknown quip 00000000009A2706 error_module_mp_e 319 error.f95 quip 00000000009A25B0 error_module_mp_e 340 error.f95 quip 00000000004083E4 MAIN 337 quip.f95 quip 0000000000EAFE96 Unknown Unknown Unknown libc-2.17.so 00002B917B6C7505 libc_start_main Unknown Unknown quip 0000000000405C2F Unknown Unknown Unknown srun: error: shi72: task 0: Aborted (core dumped) Best regards Behnam

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-692335951, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AJXONVZ2YJTQWHO3BEDYPFDSF2GXRANCNFSM4RMCY5WQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-692649097, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXONV22DA52RACELKQ3Q3TSF5EEVANCNFSM4RMCY5WQ .

gabor1 commented 4 years ago

Your hyper parameters don’t make sense. Why only 100 sparse points? That will limit your accuracy much more than the data size. Also 0.008 as your target energy accuracy is pretty high, you won’t get accurate models this way (a typical heuristic is to set the energy target accuracy to be the square of the force target accuracy). I would set zeta=4 for high accuracy. atom_sigma=0.7 is pretty large, unless your only atoms are very large (4th row and beyond), we use 0.5 for second row and 0.3 if you have H. I see that you have Au, but do you have anything else? Why are you not fitting stresses?

your delta is very small. do you have any other descriptors or baselines? the delta should be around the standard deviation of your target function, per descriptor. soap is a per-atom descriptor, so if you use it without any other descriptor to fit DFT data, your delta should be the energy variance per atom, or the binding energy per atom, which is a couple of eVs! how are you setting the e0 ? I usually advise to set it to the isolated atom energy computed in the same way as the rest of your dataset, so spin 0 and with smearing), in which case you are fitting the binding energy, and delta (in case of a single soap like here) should be the average binding energy per atom.

are you using finite electronic smearing? are you using sufficient k-point sampling (k spacing of 0.2 in VASP units, 0.03 in CASTEP units (inverse angstrom)) ? are you using the electronic free energy rather than the potential energy? (only the free energy is consistent with the forces)

Send me energy-energy and force-force target-predicted scatter plots too, I can advise based on that. Also, can you tell me more details about the diversity of your dataset?

When we have diverse data sets (like for the PRX silicon paper, or the 2020 C model) we always choose different target accuracies for different parts of the data set, e.g. liquid we have with lower target accuracy, because liquid properties don’t need it, whereas solid properties do, so letting the liquid looser lets the fit gain accuracy for the solid.

If you have multiple atoms types, before thinking that you need so much data, you should go multi-scale with two soaps.

The python hooks in quippy indeed give descriptors and their derivatives. But I think you have a lot more to explore before you go down the road of using NNs. And if I felt that large amounts of data are needed, I would probably use an iterative solver for the kernel matrix, which is mathematically closer to what stochastic gradient descent does for NNs, but we haven’t set it up yet because it was never ever needed. I would be astonished if you needed it. We have fit three-component random alloy systems with liquid, amorphous and crystalline forms and still it fit within 1.5TB of RAM and 400k scalar data points.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 09:50, bpfrd notifications@github.com wrote:

Thanks for your reply. We actually have a very diverse data set and when we train a smaller data set, we don't get the desired accuracy for the predicted forces (Please see below for the params we used). I was wondering if you also have a standalone version of soap fingerprint and its derivatives so that we can connect it to NN?

default_sigma={0.008 0.04 0 0} \

gap={ soap cutoff=6.0 \

covariance_type=dot_product \

zeta=2 \

delta=0.016 \

atom_sigma=0.7 \

l_max=4 \

n_max=8 \

n_sparse=100 \

sparse_method=cur_points} 2>&1 | grep -v FoX

Best regards Behnam

On Tue, Sep 15, 2020 at 3:45 PM gabor1 notifications@github.com wrote:

ok. So if your data file is so huge that we can’t read it in (and I agree with you that the test you did by just evaluating on it is a good test), then there is no quick fix. I bet you don’t need this much data to beat other methods that use the data sequentially (batch-training). We are unlikely to implement batch training because it’s not needed to get good models…

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 15 Sep 2020, at 11:20, bpfrd notifications@github.com wrote:

Thanks for your reply. I double checked the xyz file. It seems correct to me. It has the following format: 81 dft_energy=-0.83450731467729398E+01 pbc="T T T" Lattice="40.00000000 0.00000000 0.00000000 0.00000000 40.00000000 0.00000000 0.00000000 0.00000000 40.00000000" Properties=species:S:1:pos:R:3:Z:I:1:dft_force:R:3 Au 0.19521263724407511E+02 0.11970511101910786E+02 0.32523314883618269E+02 79 0.37124238247191382E-03 -0.15116878266811149E-02 0.15951131030196401E-02 Au 0.21771924683745674E+02 0.24413529246805833E+02 0.11657368253981389E+02 79 -0.38378602492135914E-03 0.73225724884043203E-03 -0.31714109368341596E-03 The problem is that it can not read large datasets, I think. I successfully trained a small training data and then tried to test it on a large validation data and I got the same error. specifically in this file: quip.f95", line 337 Best regards Behnam

On Tue, Sep 15, 2020 at 2:28 AM gabor1 notifications@github.com wrote:

um… the clue is >>>>>Missing value for parameter “dft_energy”<<<<<

Your XYZ file is faulty.

For comparison, we use a 1500 GB memory machine to train about 400,000 scalar data points and 10,000 sparse points (basis functions). You have 20,000+34020,000 = 2,420,000 scalar data points (I assume you have forces). We’ve never trained that big a database (because it was never needed!). you will have to reduce the number of sparse points. I really would recommend against it. It you are comparing against a neural network, you should stick to a good number fo sparse points that gives you high accuracy, and slowly increase the number of input data configurations until you reach your desired accuracy (maybe the same or better than NN), and see how many configurations you needed to achieve that. Why go further?

Also, you can’t make your stack size that big. (And if the system lets you, it may reduce available memory on the heap!)

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 14 Sep 2020, at 22:19, bpfrd notifications@github.com wrote:

Hello For the purpose of comparison,, I have to train large number of configurations (for example 20000 configurations 40*20000 atomic environments) with GAP and SOAP fingerprint. I get the below error even when I put in 900G memory in my submission script. I was wondering if it is possible to resolve that? for 500 configurations it works properly. ulimit -s unlimited set OMP_STACKSIZE=10000000000000000000 result: SYSTEM ABORT: Traceback (most recent call last - error kind IO): File "quip.f95", line 337 kind unspecified File "CInOutput.f95", line 922 kind unspecified File "CInOutput.f95", line 579 kind unspecified File "/kernph/dedeb/bin/QUIP/src/libAtoms/xyz.c", line 765 kind IO Missing value for parameter "dft_energy"

forrtl: error (76): Abort trap signal Image PC Routine Line Source quip 0000000000E5A185 Unknown Unknown Unknown libpthread-2.17.s 00002B917B2945F0 Unknown Unknown Unknown libc-2.17.so 00002B917B6DB337 gsignal Unknown Unknown libc-2.17.so 00002B917B6DCA28 abort Unknown Unknown quip 0000000000405C06 Unknown Unknown Unknown quip 00000000009A2706 error_module_mp_e 319 error.f95 quip 00000000009A25B0 error_module_mp_e 340 error.f95 quip 00000000004083E4 MAIN 337 quip.f95 quip 0000000000EAFE96 Unknown Unknown Unknown libc-2.17.so 00002B917B6C7505 libc_start_main Unknown Unknown quip 0000000000405C2F Unknown Unknown Unknown srun: error: shi72: task 0: Aborted (core dumped) Best regards Behnam

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-692335951, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AJXONVZ2YJTQWHO3BEDYPFDSF2GXRANCNFSM4RMCY5WQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-692649097, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXONV22DA52RACELKQ3Q3TSF5EEVANCNFSM4RMCY5WQ .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 commented 4 years ago

If you don't want to share these details here in the open forums, email me at gc121@cam.ac.uk

bpfrd commented 4 years ago

Thanks for your reply. Our system is Au molecules generated in MD coupled to minima hopping. We have energies and forces but not stress values. I used the new params that you suggested but now I can train a smaller subset than before (fewer than 100 data points). I think something should be wrong with my submission script. So I also attached the submission script and the train set (1000 data points) as well as plot for energies and forces of the training set. I was wondering if you get the same error as I do. Best regards Behnam

files.zip

gabor1 commented 4 years ago

Can I just check your units? Your forces seem tiny. Are these in eV/A and energies in eV ?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:10, bpfrd notifications@github.com wrote:

Thanks for your reply. Our system is Au molecules generated in MD coupled to minima hopping. We have energies and forces but not stress values. I used the new params that you suggested but now I can train a smaller subset than before (fewer than 100 data points). I think something should be wrong with my submission script. So I also attached the submission script and the train set (1000 data points) as well as plot for energies and forces of the training set. I was wondering if you get the same error as I do. Best regards Behnam

files.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

bpfrd commented 4 years ago

Yes. Energies and forces are in ev and ev/A.

On Thu, Sep 17, 2020 at 10:47 PM gabor1 notifications@github.com wrote:

Can I just check your units? Your forces seem tiny. Are these in eV/A and energies in eV ?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:10, bpfrd notifications@github.com wrote:

Thanks for your reply. Our system is Au molecules generated in MD coupled to minima hopping. We have energies and forces but not stress values. I used the new params that you suggested but now I can train a smaller subset than before (fewer than 100 data points). I think something should be wrong with my submission script. So I also attached the submission script and the train set (1000 data points) as well as plot for energies and forces of the training set. I was wondering if you get the same error as I do. Best regards Behnam

files.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-694411867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXONV4HLIV4Y74MI2DGTX3SGJHCJANCNFSM4RMCY5WQ .

gabor1 commented 4 years ago

So why are all the forces tiny?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:18, bpfrd notifications@github.com wrote:

Yes. Energies and forces are in ev and ev/A.

On Thu, Sep 17, 2020 at 10:47 PM gabor1 notifications@github.com wrote:

Can I just check your units? Your forces seem tiny. Are these in eV/A and energies in eV ?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:10, bpfrd notifications@github.com wrote:

Thanks for your reply. Our system is Au molecules generated in MD coupled to minima hopping. We have energies and forces but not stress values. I used the new params that you suggested but now I can train a smaller subset than before (fewer than 100 data points). I think something should be wrong with my submission script. So I also attached the submission script and the train set (1000 data points) as well as plot for energies and forces of the training set. I was wondering if you get the same error as I do. Best regards Behnam

files.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-694411867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXONV4HLIV4Y74MI2DGTX3SGJHCJANCNFSM4RMCY5WQ .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 commented 4 years ago

Your energies are also tiny. You claim your e0 (isolated atom energy) is 0.01, so for a cluster I would expect significant binding energy, several eV per atom, even tens of eV. Something is seriously wrong here.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:19, Gabor Csanyi gc121@cam.ac.uk wrote:

So why are all the forces tiny?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:18, bpfrd notifications@github.com wrote:

Yes. Energies and forces are in ev and ev/A.

On Thu, Sep 17, 2020 at 10:47 PM gabor1 notifications@github.com wrote:

Can I just check your units? Your forces seem tiny. Are these in eV/A and energies in eV ?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:10, bpfrd notifications@github.com wrote:

Thanks for your reply. Our system is Au molecules generated in MD coupled to minima hopping. We have energies and forces but not stress values. I used the new params that you suggested but now I can train a smaller subset than before (fewer than 100 data points). I think something should be wrong with my submission script. So I also attached the submission script and the train set (1000 data points) as well as plot for energies and forces of the training set. I was wondering if you get the same error as I do. Best regards Behnam

files.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-694411867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXONV4HLIV4Y74MI2DGTX3SGJHCJANCNFSM4RMCY5WQ .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

bpfrd commented 4 years ago

Sorry I made a mistake in the units. Energy is in Ha and forces in Ha/Bohr.

gabor1 commented 4 years ago

Take frame 564 for example. It has two Au atoms, 15 A apart. That should have its energy equal to 2E0, so 2-0.01604486=-0.03… but your file has dft_energy=-0.0023

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:23, Gabor Csanyi gc121@cam.ac.uk wrote:

Your energies are also tiny. You claim your e0 (isolated atom energy) is 0.01, so for a cluster I would expect significant binding energy, several eV per atom, even tens of eV. Something is seriously wrong here.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:19, Gabor Csanyi gc121@cam.ac.uk wrote:

So why are all the forces tiny?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:18, bpfrd notifications@github.com wrote:

Yes. Energies and forces are in ev and ev/A.

On Thu, Sep 17, 2020 at 10:47 PM gabor1 notifications@github.com wrote:

Can I just check your units? Your forces seem tiny. Are these in eV/A and energies in eV ?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:10, bpfrd notifications@github.com wrote:

Thanks for your reply. Our system is Au molecules generated in MD coupled to minima hopping. We have energies and forces but not stress values. I used the new params that you suggested but now I can train a smaller subset than before (fewer than 100 data points). I think something should be wrong with my submission script. So I also attached the submission script and the train set (1000 data points) as well as plot for energies and forces of the training set. I was wondering if you get the same error as I do. Best regards Behnam

files.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/libAtoms/QUIP/issues/236#issuecomment-694411867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXONV4HLIV4Y74MI2DGTX3SGJHCJANCNFSM4RMCY5WQ .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 commented 4 years ago

ok. So try to fix the units and see what you get.

Also, I think your energies are not in Ha, but in Ry = 0.5 Ha. Is that possible?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:28, bpfrd notifications@github.com wrote:

Sorry I made a mistake in the units. Energy is in Ha and forces in Ha/Bohr.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 commented 4 years ago

In your job submission script, why are you not using a gap_fit that is compiled using OpenMP ? Your program will run in serial unless you use the QUIP_ARCH that ends in openmp, such as linux_x86_64_gfortran_openmp.

( I recommend sticking to GNU compilers, we rarely test with ifort these days, and it’s more finicky for no real advantage. You can still use MKL of course. )

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:29, Gabor Csanyi gc121@cam.ac.uk wrote:

ok. So try to fix the units and see what you get.

Also, I think your energies are not in Ha, but in Ry = 0.5 Ha. Is that possible?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:28, bpfrd notifications@github.com wrote:

Sorry I made a mistake in the units. Energy is in Ha and forces in Ha/Bohr.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

gabor1 commented 4 years ago

For testing purposes, I would drop to even fewer training structures, you should be getting decent results just with a few hundred training structures, because you have quite a lot of atoms per structure. I recommend you start with 200 structures, and get the maximum accuracy there (i.e. find the lowest default_sigma values that you can still REACH on the test set as well, and explore increasing n_sparse to 1000,2000 etc until it doesn’t help significantly any more (i.e. error reduction is < 20%). Then you can say you are limited by data, and add more training structures. This also makes the training time and memory requirements much shorter while giving you a good feel for what’s going on.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:36, Gabor Csanyi gc121@cam.ac.uk wrote:

In your job submission script, why are you not using a gap_fit that is compiled using OpenMP ? Your program will run in serial unless you use the QUIP_ARCH that ends in openmp, such as linux_x86_64_gfortran_openmp.

( I recommend sticking to GNU compilers, we rarely test with ifort these days, and it’s more finicky for no real advantage. You can still use MKL of course. )

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:29, Gabor Csanyi gc121@cam.ac.uk wrote:

ok. So try to fix the units and see what you get.

Also, I think your energies are not in Ha, but in Ry = 0.5 Ha. Is that possible?

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 19:28, bpfrd notifications@github.com wrote:

Sorry I made a mistake in the units. Energy is in Ha and forces in Ha/Bohr.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

bpfrd commented 4 years ago

I fixed the units and got very good training error with 200 structures in train set and n_sparse=1000. I found some unphysical structures in my data that were causing the error. The problem was in some of the structures there was a space between energy label and the numbers like "dft_energy= number". Now it can handle large files. I didn't generate the data myself. So I should check with my colleagues to find the source of the unphysical energies and forces. Thanks for noticing that, Best regards Behnam

gabor1 commented 4 years ago

Oh great, so the “cannot find dft_energy” in fact was a problem with the file. Good!

I’m still keen to help you get the best possible model, so do let me know how good your errors are.

There is a general rule of thumb that cutoff should be nmax*atom_sigma, so if you want to stick with a cutoff of ~5, and an atom_sigma of 0.5, then increase n_max to 10 or 12.

To squeeze the last bit of accuracy out of your data, you might want to do “double soap”, which is a multi scale model:

gap={soap cutoff=4 cutoff_transition_width=1 atom_sigma=0.5 .. .. .. : soap cutoff=8 cutoff_transition_width=2 atom_sigma=1 … … }

Keep the rest of the parameters the (n_max and l_max and the others) the same for both. It will be twice as expensive to train and run, but might be more accurate than a single soap (even if it has longer cutoff like 5 or 6), and definitely more accurate than a single soap with cutoff=8.

-- Gábor

Gábor Csányi Professor of Molecular Modelling Engineering Laboratory, University of Cambridge Pembroke College Cambridge

Pembroke College supports CARA. A Lifeline to Academics at Risk. http://www.cara.ngo/

On 17 Sep 2020, at 23:09, bpfrd notifications@github.com wrote:

I fixed the units and got very good training error with 200 structures in train set and n_sparse=1000. I found some unphysical structures in my data that were causing the error. The problem was in some of the structures there was a space between energy label and the numbers like "dft_energy= number". Now it can handle large files. I didn't generate the data myself. So I should check with my colleagues to find the source of the unphysical energies and forces. Thanks for noticing that, Best regards Behnam

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

bpfrd commented 4 years ago

Thanks for the great help. We will remove the nonphysical points from the data and fit the data with your suggested parameters. We will keep you informed with the results and errors. Best regards Behnam