idptools / parrot

Python package for protein sequence-based bidirectional recurrent neural network. Generalizable to a variety of protein bioinformatic applications.
MIT License
16 stars 2 forks source link

3D tertiary protein data #13

Closed H1889 closed 7 months ago

H1889 commented 7 months ago

First of all, thank you for making this fantastic tool. I wonder if PARROT can handle 3D structure data of proteins, for instance, could you enter the 3 alpha carbon coordinates? In the manual it says that each amino acid of each sequence has only one value associated with it, would it be possible to enter more than one value (in this case 3 coordinates) for each amino acid? Something like this:

Seq1 MTRWET... 3,2,1 4,3,2 3,4,1 ....

or like this:

Seq1_x MTRWET... 3 4 3.... Seq1_y MTRWET... 2 3 4 .... Seq1_z MTRWET... 1 2 1 ....

Thanks and greetings

degriffith commented 7 months ago

Hello and thank you,

The short answer is that no, that is currently not possible in PARROT. However, there are possible work-arounds by either making a few modifications to the PARROT source code or by reformulating your problem in a way that PARROT can handle.

As currently as PARROT is implemented, it can only handle a single value per residue. Though this was a design decision we made and not an explicit constraint of PyTorch LSTM models. One could modify PARROT's source code to alter the expected shape of the input and output data and how the model calculates the loss function. This would entail some substantial changes in the files process_input_data.py, brnn_architecture.py and train_network.py, but is technically feasible.

An alternative solution would be to train separate models for each of your per-residue variables of interest (in your example, one model for x-coords, one for y-coords, one for z-coords). I don't think this would work very well because X,Y,Z coordinates are not independent from one another, and you would lose a lot of information this way. Alternatively you could try to reformulate your problem away from Cartesian space and train a model to predict Ramachandran angles instead.

All that being said, if you are ultimately interested in making a model that predicts structure from primary sequence. I don't think PARROT is your best choice.

H1889 commented 7 months ago

thanks for the answer, that's what I suspected. Regards