fail to train ivampnets due to the size of tensor do not match

Toolxx commented 1 year ago

Hi, I got an error while training iVAMPNets with my personal dataset. I followed your tutorial almost exactly (https://github.com/markovmodel/ivampnets/blob/main/SynaptotagminC2A.ipynb), expected the data stride of my dataset equal to 1. However, I got this error. As I check the dimensions of the data I have, it is 21420. Because I basically followed the tutorial, and it can run smoothly in the original example of the paper, I don't quite understand what went wrong. Could you please tell me where this problem might be? Or which part of the object/variable do I need to check? I would be very grateful if you could provide some insight.

RuntimeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_146024\349367218.py in 2 # Here, the mask should be well trained, so we disable the mask training from here on. 3 mask.noise=0. ----> 4 model = ivampnet.fit(loader_train, n_epochs=500, validation_loader=loader_val, mask=False, lam_decomp=100., 5 lam_trace=0., start_mask=0, end_trace=0, tb_writer=writer, clip=False).fetch_model()

d:\PythonTest\ivampnets.py in fit(self, data_loader, n_epochs, validation_loader, mask, lam_decomp, lam_trace, start_mask, end_trace, train_score_callback, validation_score_callback, tb_writer, reset_step, clip, save_criteria, **kwargs) 1137 lam_trace = 0. 1138 for batch_0, batch_t in data_loader: -> 1139 self.partial_fit((batch_0, batch_t), lam_decomp=lam_decomp, mask=train_mask, 1140 lam_trace=lam_trace, 1141 train_score_callback=train_score_callback, tb_writer=tb_writer,

d:\PythonTest\ivampnets.py in partial_fit(self, data, lam_decomp, mask, lam_trace, train_score_callback, tb_writer, clip) 973 if mask: 974 self.optimizer_mask.zero_grad() --> 975 chi_t_list = self.forward(batch_0) # returns list of feature vectors 976 chi_tau_list = self.forward(batch_t) 977 # Estimate all individual scores and singular functions

d:\PythonTest\ivampnets.py in forward(self, data) 896 if data.get_device(): 897 data = data.to(device=self.device) ... --> 210 masked_x = x[:,:,None] * alpha 211 if self.noise>0.: 212 max_attention_value = torch.max(alpha, dim=1, keepdim=True)[0].detach()

RuntimeError: The size of tensor a (21420) must match the size of tensor b (21321) at non-singleton dimension 1

amardt commented 1 year ago

Hi and happy new year! I am excited that you are trying out our method! In order to help you, it would be really good to know, how you have prepared the data? The mask, where the shape error occurs, assumes very specific inputs. A first thing to check: When you create the mask, it will print out the number of residues it assumes your system have. Is this the correct number in your case? Perhaps you can write here

how many residues your system has
Are you skipping residues, when estimting the distances. If so, how many?
Are you leaving out residues at the end of the chain? This way we can verify, how you end up with 21420 input dimensions, and perhaps also deduce, why the mask assumes less. I hope this way we can find the cause of the error. Thank you for your patience. I am confident that we can fix it easily. As a side note. To me it looks like you have a very large number of input features. Perhaps it would be wise, to actually only estimate distances between pairs of residues, e.g. This way you will keep the number of parameters small. Another way would be to change the architecture of the network after the mask to something more sophisticated, e.g. a graph neural network. Secondly, have you first tried to build a vanilla VAMPnet? Did this work? Or are you running into problems? Also there it could be beneficial to change the architecture to a graph NN to decrease the network size...

Feel free to ask any questions if something is still unclear. I am happy to be of any help

Best,

Andreas

Toolxx commented 1 year ago

Hi, Thanks for your help. I have been away from work for a short time for some reasons and I hope you can continue to help me. I checked the points you mentioned. For the data preparation process, I use the position feature instead of minimum distance between residues as the system is relatively large. The code is here:

# file paths
syt_files = 'protein_xtc/protein.xtc'
topfile = 'protein_xtc/protein.pdb'
outfile = 'protein_xtc/protein_stride1.hdf5'
# define pyemma featurizer
feat = pyemma.coordinates.featurizer(topfile)
feat.add_selection(feat.select_Backbone())
print(feat.dimension())
# create iterator
data_source = pyemma.coordinates.source(syt_files, feat)
### process data with featurizer and write to disk

# note that stride parameter here must be multiplied by the stride on the
# trajectories that we're loading (which is 10),
# i.e., loading with stride 10 here is a total stride of 100. compare `outfile`

it = data_source.iterator(stride=1, chunk=1000)

with h5py.File(outfile, "w") as f:
    last_trajid = -1
    for trajid, chunk in tqdm(it, total=it.n_chunks):

        if last_trajid < trajid:
            if last_trajid != -1:
                dset.flush()
            dset = f.create_dataset(syt_files[trajid].split('/')[-1], 
                                    shape=(it.trajectory_length(), feat.dimension()), 
                                    dtype=np.float32)
            start = 0
            last_trajid = trajid
        dset[it.pos:it.pos + it.chunksize if it.pos + it.chunksize < it.trajectory_length() else None] = chunk
        start += 1

The problems are likely to come from the number of residues and the corresponding masks. I haven't found the code needed to query the number of residues in pyemma/mdtraj, could you please provide some guidance? For the mask part, the hyperparameters are

# data preparation
# how many residues are skipped for distance calculation
skip_res = 1
# Size of the windows for attention mechanism
patchsize = 8
# How many residues are skipped before defining a new window
skip_over = 1

and it returns

cuda:0
Number of residues is: 207
Mask_proteins(
  (normalizer): Mean_std_layer()
  (list_weights): ParameterList(
      (0): Parameter containing: [torch.float32 of size 1x214x1 (GPU 0)]
      (1): Parameter containing: [torch.float32 of size 1x214x1 (GPU 0)]
  )
)
[Sequential(
  (Layer_input): Linear(in_features=21420, out_features=500, bias=True)
  (Elu_input): ELU(alpha=1.0)
  (Layer0): Linear(in_features=500, out_features=500, bias=True)
  (Elu0): ELU(alpha=1.0)
  (Layer1): Linear(in_features=500, out_features=500, bias=True)
  (Elu1): ELU(alpha=1.0)
  (Layer2): Linear(in_features=500, out_features=500, bias=True)
  (Elu2): ELU(alpha=1.0)
  (Layer3): Linear(in_features=500, out_features=500, bias=True)
  (Elu3): ELU(alpha=1.0)
  (Layer_output): Linear(in_features=500, out_features=8, bias=True)
  (Softmax): Softmax(dim=1)
), Sequential(
  (Layer_input): Linear(in_features=21420, out_features=500, bias=True)
  (Elu_input): ELU(alpha=1.0)
...
  (Elu3): ELU(alpha=1.0)
  (Layer_output): Linear(in_features=500, out_features=8, bias=True)
  (Softmax): Softmax(dim=1)
)]

Sorry I am still a little unclear about the data accessing and manipulation. Can the above information reveal some possible problems? If it is convenient, can you give some suggestions? I will keep working to find the problem. Thanks

amardt commented 1 year ago

Hi, sorry for the long silence, but this message slipped somehow my attention. So the problem is that the Mask object we provide in this scripts here, assumes that you are using distances between residues. This means, you cannot simply switch to real coordinates features, I am afraid. So you would have 2 options:

Go with the coordinates features and we have to create a new Mask layer for that purpose. I am happy to do that with you.
Stick to residue distances, but you would combine several residues into one node (or super-residue, if you like). Thereby, you reduce the number of features significantly.

I guess you would tend to number of 1, but I want to raise a real advantage of option 2. By taking internal coordinates (distances over coordinates), the network is invariant to shifts and rotations of your system, which is really helpful for NN training in general. In the case of coordinates, you need to align your structures to some reference, but even then the NN needs to learn from the data to distinguish real internal movement from overall shifts/rotations. This introduces a lot of noise for the NN and makes it a lot less data efficient.

In summary, my recommendation is to work again with residue distances, but combine 2/3/4 residues into one and estimate the smallest distances to the other groups. Thereby, you will be more data efficient (and for such a large system you need to be data efficient).

I know this is a very late response, but perhaps still helpful. Best, Andreas

Toolxx commented 1 year ago

Hi, Thanks a lot for your help. It is very helpful for me! I will spend some time working to try out the ideas you mentioned, and I will come here to ask you for advice or some help if necessary.

Toolxx commented 1 year ago

Hi,

I would like to ask another question about model validation. In the pipeline of the previous Markov state model (as well as VAMPnets), the general model validation part will have a Chapman-Kolmogorov test. But there is no similar verification step in your iVAMPnets study. I would like to ask is there any reason? For example, is it not necessary to do this validation when the implied timescale converges and the VAMPE score is high? Or is it better to do a Chapman-Kolmogorov test for each subsystem?

Thanks

Toolxx commented 1 year ago

Hi,

After the last message, I listen to your advice and stick to residue distances to build an iVAMPnet for an attempt. Meanwhile, I check the VAMP score of the feature space. As I choose a relatively large number of "excluded_neighbors", the number of dimensions is feasible for the computation (around 20k-35k). And the VAMP score of feature space is well perform than coordinate/angle feature. It provides me a chance to build a model directly.

When I work like this, the results are not ideal. I remembered your suggestion from the latest message -- use residue distances feature and combine several residues into one node. I think it might be more data efficient (with better information) and provide a better input to build a better iVAMPnet model. However, when I want to make this work, my progress gets stuck. I am not familiar with data manipulation, could you please provide some specific suggestions? For example, is there any specific packages (or code) available to complete this task of merging a few residues distances to one? Can it be solved with functions in MDTraj or pyemma?

Thank you for any assistance

amardt commented 1 year ago

Hi,

so let me first try to be more specific. But if you get stuck again, I am happy help out with actual code. So the easiest way to make pairs of residues is by first estimating the distances as in the example script. Now in an after process we will combine this information of several residues. We want to pair now residues which are neighbors in the sequence together. Right now each residue has a distance to all the other residues. Lets take an example residue 25 and 26 have all the distances to all the other residues. Now we want to pair these. We therefore want to take the distances to another pair of residues, e.g. Residue 29 and 30 which are 4, and now only keep the distance which is the smallest between the two pairs. Thereby we reduced the number of distances by a factor of 4 ;-) The important functions you need are: https://mdtraj.org/1.9.4/api/generated/mdtraj.compute_contacts.html#mdtraj.compute_contacts https://mdtraj.org/1.9.4/api/generated/mdtraj.geometry.squareform.html The second transforms the distances so you have for all the residues all distances. (Thereby you actually have double entries there) Given this matrix you need to downsample this matrix with a min pool with step/skip size as large as how many residues you want to combine in a super residue ;-) I hope the explanation helps Otherwise, be free to shoot more questions.

Toolxx commented 1 year ago

Hi,

Thanks for your help. I am trying to implement this function now.

Another problem is about yesterday's message (you may have inadvertently ignored it because I sent two messages in a row), which is about model validation. Could you briefly illustrate why does not the CK test appear in your research articles? What I want to know is whether the CK test of iVAMPnet/IMD is as necessary as the traditional pipeline. Thanks.

And for the previous question, I have produced the contact map and do down sampling. Now I have a contact map with the nodes I need.

t = mdtraj.load_pdb(topfile)

pairs = feat.pairs(np.arange(0, feat.topology.n_residues - 0), excluded_neighbors=2000)

contacts = mdtraj.compute_contacts(t, pairs)

print(np.shape(contacts[0]),np.shape(contacts[1]))

(1, 72010) (72010, 2)

contact_map = mdtraj.geometry.squareform(contacts[0],contacts[1])
print(np.shape(contact_map))

(1, 2380, 2380)

contact_map_reduced = contact_map[:,::8,::8]

print(np.shape(contact_map_reduced))

(1, 298, 298)

However, I am temporarily no ideas on what to do next. How do I encode the corresponding pair into the feature in pyemma's feature. Any suggestion? Thanks

markovmodel / ivampnets

fail to train ivampnets due to the size of tensor do not match #2