Convergence failure in VAMPE_score

wrmartin commented 2 years ago

Hello,

I am a novice at using ML techniques like this, so forgive me for the simplistic question. I have been trying to apply the parameters given in your jupyter notebooks to my system and coming up with some issues. For background, I am running on a dataset of 690 1 us simulations on a 299 aa protein (3aa trimmed per termini), so my input feature count is 41328. I am testing for a proper output space, but generally receive the following after a varying number of epochs: Traceback (most recent call last): File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/testing.py", line 178, in <module> model = ivampnet.fit(loader_train, n_epochs=epochs, validation_loader=loader_val, mask=True, lam_decomp=20., lam_trace=1., start_mask=0, end_trace=20, tb_writer=writer, clip=False).fetch_model() File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 1140, in fit self.partial_fit((batch_0, batch_t), lam_decomp=lam_decomp, mask=train_mask, File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 979, in partial_fit scores_single, S_single, u_single, v_single, trace_single = score_all_systems(chi_t_list, chi_tau_list, File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 407, in score_all_systems score_i, S_i , u_i, v_i, trace_i = VAMPE_score(chi_i_t, chi_i_tau, epsilon=epsilon, mode=mode) File "/gpfs/u/scratch/CLVL/CLVLwrtn/working/08.07/61/4_4_0_test1/ivampnets.py", line 247, in VAMPE_score a, sing_values, b = torch.svd(K, compute_uv=True) RuntimeError: svd_cuda: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 5). This particular test is with an output size of [ 4,4 ]. My assumption is this is based on my output size, but I'm not certain.

My parameter list:

batch_size = 1000 valid_ratio = 0.15 test_ratio = 0.0001 network_depth = 4 layer_width = 100 nodes = [layer_width]*network_depth skip_res = 6 patchsize = 8 skip_over = 4 factor_fake = 2. noise = 2. cutoff=0.9 learning_rate=0.0005 epsilon=1e-6 score_mode='regularize' # one of ('trunc', 'regularize', 'clamp', 'old')

amardt commented 2 years ago

Hi, thank you for reaching out and already giving us so much background information! So the outputsize should not cause the problem. To me it looks like an training issue. So you said this happens after some epochs. So I assume it works for a few. Can you perhaps show us a plot of the training score/validation score so far in the training process? My assumption is right now that the network collapses to a trivial solution, which then causes the eigensolver to fail, because you will have 1 singular value 1 and the rest 0. But to know sure, the mentioned plot would be very helpful. In principle, what you can do to ease the training: Reduce the noise variable in the beginning and/or start with a smaller lam_decomp value in the fit routine, which regulates how strongly you enforce the independence constraint.

Looking forward to hearing back from you!

Best, Andreas

wrmartin commented 2 years ago

Hi Andreas,

I seem to have fixed it, though I'm not sure the solution is correct. I tried changing the layer width, thinking the issue was the large input dimension comparatively. Increasing that resulted in even faster failures. So, I tried a layer width of 50 and haven't had any failures since, though only through two "steps" so far (initial mask noise and noise=5). I'm running these on an HPC so I don't have the plot for the above run, but I did a few runs using try/except before. The parameters are the same, just a different set of data for this one. This one failed at 65/100.

scores1

wrmartin commented 2 years ago

So I've run into the same error again, but I can't sort why. Here is what I've done:

Parameters:

output_sizes = [5,5] batch_size = 10000 valid_ratio = 0.30 test_ratio = 0.0001 network_depth = 4 layer_width = 50 nodes = [layer_width]*network_depth skip_res = 6 patchsize = 8 skip_over = 4 factor_fake = 2. noise = 2. cutoff=0.9 learning_rate=0.0005 epsilon=1e-6 score_mode='regularize'

Workflow:

model = ivampnet.fit(loader_train, n_epochs=50, validation_loader=loader_val, mask=True, lam_decomp=20., lam_trace=1., start_mask=0, end_trace=20, tb_writer=writer, clip=False).fetch_model() plot_mask(mask, skip=10, step=1) plot_protein_mask(mask, skip_start=4, step=1) ivampnet.save_params('params1') mask.noise=5. model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=True, lam_decomp=50., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model() plot_mask(mask, skip=10, step=2) plot_protein_mask(mask, skip_start=4, step=2) ivampnet.save_params('params2') mask.noise=10. model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=True, lam_decomp=100., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model() plot_mask(mask, skip=10, step=3) plot_protein_mask(mask, skip_start=4, step=3) ivampnet.save_params('params3') mask.noise=0. model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=False, lam_decomp=100., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model() ivampnet.save_params('params4') model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=False, lam_decomp=50., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model() ivampnet.save_params('params5') model = ivampnet.fit(loader_train, n_epochs=100, validation_loader=loader_val, mask=False, lam_decomp=0., lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False, save_criteria=0.012).fetch_model() ivampnet.save_params('params6')

I'm running multiple variants; some are still running, a few have finished successfully. I can't sort why this one has failed. I received the same error as I did before in the 6th "step" in the workflow. I'm attaching the VAMPE trace from tensorboard as well as the Pen_scores plot (MR from the manuscript I think?) C00, C11, and C01 don't have the same uptick near the end that Pen_scores does, so I'm not sure if it's relevant.

Thank you for any assistance!

VAMPE Pen_scores

amardt commented 2 years ago

Hi, you are correct MR in the manuscript is pen_scores in the notebook. So based on your observation, that reducing the width of the layer helps to avoid the error, I would suggest regularizing the network more. There are 3 ways to achieve that:

Using the same feed forward network, but using regularization schemes such as Dropout layer, spectral norm regularization, or simply the AdamW optimizer with weight decay. If you choose this option I would start by using AdamW, where you then can play with the weight decay strength. Personally, I prefer spectral norm over Dropout.
Reducing the input features. Either by removing distances which are most likely useless, but this would require to adapt the mask layer accordingly. Alternatively, you could combine two neighboring residues to one instance. Advantage would be that you can use the mask layer as it is (perhaps reducing the patchsize and skip_over parameters, divide by two), but you need to adopt the feature extraction, where you need to find the closest distance between pairs of residues. In principle you can estimate them from your current input.
Advance the whole method by replacing the feed forward networks by a more advanced one such as GraphNeuralNetworks, see for example Schnet (https://proceedings.neurips.cc/paper/2017/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf). There the number of weights is kept low by sharing parameters.

Would love to hear if there suggestions help you!

Best

wrmartin commented 2 years ago

Thank you so much! Would using GNN substantially change the parameter set? I hope that using ~25-30 nearest neighbors would be sufficient for constructing the dataset, but don't know how that would impact some of the new parameters when compared to "traditional" vampnets.

amardt commented 1 year ago

Hey, Since GNN share parameters for the different nodes (you convolve), the number of parameters can reduce substantially while still being able to represent very nonlinear complex functions. There are now several great works of equivariant/invariant GNN for molecules out there worth exploring.

markovmodel / ivampnets

Convergence failure in VAMPE_score #1