Open wrmartin opened 2 years ago
Hi, thank you for reaching out and already giving us so much background information! So the outputsize should not cause the problem. To me it looks like an training issue. So you said this happens after some epochs. So I assume it works for a few. Can you perhaps show us a plot of the training score/validation score so far in the training process? My assumption is right now that the network collapses to a trivial solution, which then causes the eigensolver to fail, because you will have 1 singular value 1 and the rest 0. But to know sure, the mentioned plot would be very helpful. In principle, what you can do to ease the training: Reduce the noise variable in the beginning and/or start with a smaller lam_decomp value in the fit routine, which regulates how strongly you enforce the independence constraint.
Looking forward to hearing back from you!
Best, Andreas
Hi Andreas,
I seem to have fixed it, though I'm not sure the solution is correct. I tried changing the layer width, thinking the issue was the large input dimension comparatively. Increasing that resulted in even faster failures. So, I tried a layer width of 50 and haven't had any failures since, though only through two "steps" so far (initial mask noise and noise=5). I'm running these on an HPC so I don't have the plot for the above run, but I did a few runs using try/except before. The parameters are the same, just a different set of data for this one. This one failed at 65/100.
So I've run into the same error again, but I can't sort why. Here is what I've done:
Parameters:
output_sizes = [5,5] batch_size = 10000 valid_ratio = 0.30 test_ratio = 0.0001 network_depth = 4 layer_width = 50 nodes = [layer_width]*network_depth skip_res = 6 patchsize = 8 skip_over = 4 factor_fake = 2. noise = 2. cutoff=0.9 learning_rate=0.0005 epsilon=1e-6 score_mode='regularize'
Workflow:
model = ivampnet.fit(loader_train, n_epochs=50, validation_loader=loader_val, mask=True, lam_decomp=20., lam_trace=1., start_mask=0, end_trace=20, tb_writer=writer, clip=False).fetch_model() plot_mask(mask, skip=10, step=1) plot_protein_mask(mask, skip_start=4, step=1) ivampnet.save_params('params1') mask.noise=5. model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=True, lam_decomp=50., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model() plot_mask(mask, skip=10, step=2) plot_protein_mask(mask, skip_start=4, step=2) ivampnet.save_params('params2') mask.noise=10. model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=True, lam_decomp=100., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model() plot_mask(mask, skip=10, step=3) plot_protein_mask(mask, skip_start=4, step=3) ivampnet.save_params('params3') mask.noise=0. model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=False, lam_decomp=100., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model() ivampnet.save_params('params4') model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=False, lam_decomp=50., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model() ivampnet.save_params('params5') model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=False, lam_decomp=0., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False, save_criteria=0.012).fetch_model() ivampnet.save_params('params6')
I'm running multiple variants; some are still running, a few have finished successfully. I can't sort why this one has failed. I received the same error as I did before in the 6th "step" in the workflow. I'm attaching the VAMPE trace from tensorboard as well as the Pen_scores plot (MR from the manuscript I think?) C00, C11, and C01 don't have the same uptick near the end that Pen_scores does, so I'm not sure if it's relevant.
Thank you for any assistance!
Hi, you are correct MR in the manuscript is pen_scores in the notebook. So based on your observation, that reducing the width of the layer helps to avoid the error, I would suggest regularizing the network more. There are 3 ways to achieve that:
Would love to hear if there suggestions help you!
Best
Thank you so much! Would using GNN substantially change the parameter set? I hope that using ~25-30 nearest neighbors would be sufficient for constructing the dataset, but don't know how that would impact some of the new parameters when compared to "traditional" vampnets.
Hey, Since GNN share parameters for the different nodes (you convolve), the number of parameters can reduce substantially while still being able to represent very nonlinear complex functions. There are now several great works of equivariant/invariant GNN for molecules out there worth exploring.
Hello,
I am a novice at using ML techniques like this, so forgive me for the simplistic question. I have been trying to apply the parameters given in your jupyter notebooks to my system and coming up with some issues. For background, I am running on a dataset of 690 1 us simulations on a 299 aa protein (3aa trimmed per termini), so my input feature count is 41328. I am testing for a proper output space, but generally receive the following after a varying number of epochs:
Traceback (most recent call last): File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/testing.py", line 178, in <module> model = ivampnet.fit(loader_train, n_epochs=epochs, validation_loader=loader_val, mask=True, lam_decomp=20., lam_trace=1., start_mask=0, end_trace=20, tb_writer=writer, clip=False).fetch_model() File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 1140, in fit self.partial_fit((batch_0, batch_t), lam_decomp=lam_decomp, mask=train_mask, File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 979, in partial_fit scores_single, S_single, u_single, v_single, trace_single = score_all_systems(chi_t_list, chi_tau_list, File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 407, in score_all_systems score_i, S_i , u_i, v_i, trace_i = VAMPE_score(chi_i_t, chi_i_tau, epsilon=epsilon, mode=mode) File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 247, in VAMPE_score a, sing_values, b = torch.svd(K, compute_uv=True) RuntimeError: svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 5).
This particular test is with an output size of [ 4,4 ]. My assumption is this is based on my output size, but I'm not certain.My parameter list:
batch_size = 1000 valid_ratio = 0.15 test_ratio = 0.0001 network_depth = 4 layer_width = 100 nodes = [layer_width]*network_depth skip_res = 6 patchsize = 8 skip_over = 4 factor_fake = 2. noise = 2. cutoff=0.9 learning_rate=0.0005 epsilon=1e-6 score_mode='regularize' # one of ('trunc', 'regularize', 'clamp', 'old')