marrink-lab / polyply_1.0

Generate input parameters and coordinates for atomistic and coarse-grained simulations of polymers, ssDNA, and carbohydrates
Apache License 2.0
119 stars 21 forks source link

Non-homogenous branched polymer building #367

Closed Jonifrc312 closed 2 months ago

Jonifrc312 commented 2 months ago

Hello @fgrunewald,

Firstly, thank you for the tool you've created, it has helped me and other's a lot and you have clearly done a wonderful job.

I've created an .ff file defining all the different residues that compose this polymer and the links between them. I just can't seem to create the graph sequence that outlines how they are connected using the gen_seq function.

In this case, we have 2 residues, A and B, alternating with a branch occurring only at residue B. This polymer can be n dimers long but the branch is always residue's CDE long.

      A --- B --- A --- B --- A --- B --- A --- B --- ...
            |           |           |           |
            C           C           C           C
            |           |           |           |
            D           D           D           D
            |           |           |           |
            E           E           E           E

Can you help me create a sequence.json that outlines this polymer conformation using gen_seq?

fgrunewald commented 2 months ago

Hi @Jonifrc312,

Glad to hear that you like polyply. The sequence you intend to use is indeed a bit awkward to generate with gen_seq, which is why we are working on replacing it with a new notation (CGSmiles) that allows describing these polymers using a string. You can already use it as follows:

  1. Download and install cgsmiles:
    pip install git+https://github.com/gruenewald-lab/CGsmiles.git
  2. Use the following custom script to generate the residue graph. In the line cgsmiles_str adjust the number after the '|' to change the number of residues. Please note this script also plots the residue graph using matplotlib.
import json
import networkx as nx
from networkx.readwrite import json_graph
import matplotlib.pyplot as plt
from cgsmiles.resolve import MoleculeResolver

# in you case the final cgsmiles string is
cgsmiles_str = "{[#repeat]|10}.{#repeat=[<][#A][#B][>]([#C][#D][#E])}"

# next read the cgsmiles string and generate a networkx graph
_, graph = MoleculeResolver(cgsmiles_str, all_atom=False).resolve()

# now we need to convert the fragname to resname and set the
# residue ids
fragnames = nx.get_node_attributes(graph, 'atomname')
nx.set_node_attributes(graph, fragnames, 'resname')
resids = {node:node+1 for node in graph.nodes}
nx.set_node_attributes(graph, resids, 'resid')

# for checking let's draw the graph
name_to_color = {'A': 'tab:blue', 'B': 'tab:blue', 'C': 'tab:red', 'D': 'tab:red', 'E': 'tab:red',}
colors = [name_to_color[graph.nodes[node]['atomname']] for node in graph.nodes]
nx.draw_networkx(graph, with_labels=True, labels=fragnames, node_color=colors, pos=nx.kamada_kawai_layout(graph))
plt.show()

# finally we save it to the .json file
g_json = json_graph.node_link_data(graph)
with open("custom_graph_seq.json", "w") as file_handle:
    json.dump(g_json, file_handle, indent=2)

Please let me know if this works for you and/or if you have any further questions.

fgrunewald commented 2 months ago

@Jonifrc312 please note that I adjusted the code. I wasn't paying enough attention this morning to the sequence. To get different length you need to change the 10 in this line for the number of repeat units:

cgsmiles_str = "{[#repeat]|10}.{#repeat=[<][#A][#B][>]([#C][#D][#E])}"
Jonifrc312 commented 2 months ago

Hello @fgrunewald

Thank you for you quick and extensive reply! The code works wonders and I was easily able to apply it to my situation. Now I seem to have a different problem where:

If i use "polyply gen_itp -f file.ff -seqf graph.json" no links are being applied to the molecule

What is weird about this is if I simply create a dimer with AB-CDE all links are applied correctly (both bonds and angles) "polyply gen_itp -f file.ff -seq A:1 B:1 C:1 D:1 E:1"

Do you have any ideia of what might be occuring?

fgrunewald commented 2 months ago

@Jonifrc312 glad to hear!

You probably used + and ++ as prefix in the links; Since your resids are non-continous anymore you need to change to > >> so just replace each plus by a larger sign. If that does not work can you provide a sample of your ff file

Jonifrc312 commented 2 months ago

Hello again! Sorry for the delay, we've been trying to fix the .ff file but to no avail. There also seems to be a problem with the way CGsmiles handles repetitions:

Let's say we wanted to create the graph for

A1 --- B1 --- A2 --- B2
       |             |
       C1            C2
       |             |
       D1            D2
       |             | 
       E1            E2 

From my understanding of the previous comment the following CGsmile string would create it: 
cgsmiles_str = "{[#repeat]|2}.{#repeat=[<][#A][#B][>]([#C][#D][#E])}"

However, what actually happens is that the bond between group 1 and group 2, according to the graph created, occurs between A1 and B2, and not B1 and A2. This isn't clear when looking at the drawn network but becomes clear when listing all "graph.edges".

We tried to rearrange the CGsmiles but we couldn't make it work using the "{[#repeat]|2}" nomenclature. We did fix it by using the very funny looking CGstring:

cgsmiles_str = "{[#repeat]|1}.{#repeat=[<][#A][#B][>]([#C][#D][#E])[<][#A][#B][>]([#C][#D][#E])}"

Which essentially just does the repeating manually, and creates the desired graph with a bond between B1 and A2.

About the use of > and >> I have the following questions:

  1. If I'm describing the link between A1 and B1, or B1 and C1, I would use the + as they are continuous.
  2. If I'm describing the link between B1 and A2, I would use > as they are the only non-continuous residues, right?
  3. When using >, does the number of > scale with the distance of residues? So for example, between B1 and A2, would I use 4 (B >>>>A) or just 2 (B >>A)?
fgrunewald commented 2 months ago

Hi @Jonifrc312,

I see! This problem is more a polyply than cgsmiles problem but the solution is quite easy: Flip the A and B in the cgsmiles string and use:

"{[#repeat]|2}.{#repeat=[<][#B][#A][>]([#C][#D][#E])}"

Then your B1 will connect to A2. In the end polyply should be smart enough to give the same answer for both graphs as they are fully isomorphic.

By the way the other option if you wanted to do this manually would have been:

from cgsmiles import read_cgsmiles

cgsmiles_str = "{[#A][#B]([#C][#D][#E])[#A][#B]([#C][#D][#E])[#A][#B]([#C][#D][#E])}"
graph = read_cgsmiles(cgsmiles_str)
... rest of the script

Once CGSmiles is ready we'll put this all into a nice documentation but for now I've to apologize for this kind of "documentation".

To answer your connector questions:

1) Yes you can, but you can (optionally) also use the '>'. Let's say good practice is using the '+' 2) Same answer as above but good practice is using '>' 3) No for all practical purposes > means larger resid and >> means larger resid but different and larger than the one for >

I hope that helps!

P.S We're keeping some track of the people using polyply for funding purposes, would you mind telling me what University/Company you're with?

Jonifrc312 commented 2 months ago

Hello @fgrunewald

After testing a lot of combinations of .ff settings, graph sequences and looking into polyply code, we've been able to fix the problem and now everything is running correctly!

Essentially, the graph that was being given to polyply had an attribute called atomname, instead of resname (A,B,C,etc). This would lead to polyply overwriting all atomnames in the MetaMolecule Class graph with the provided atomnames, instead of those outlined in the .ff file. Consequently, when polyply would try and apply links to the molecule, it wasn't able to make matches due to the difference in the atomnames in the Link Class, which come from .ff file (I believe), and those in the MetaMolecule Class, which where overwritten by the graph.

Apologies if I misinterpreted some code, but I think that was what was happening.

To fix this, we simply had to make sure that atomname wasn't an attribute of the nodes. In the end, the graph just needed two attributes, resname and resid

To give an example of the code that worked:

from cgsmiles import read_cgsmiles

cgsmiles_str = "{[#A][#B]([#C][#D][#E])[#A][#B]([#C][#D][#E])[#A][#B]([#C][#D][#E])}"
graph = read_cgsmiles(cgsmiles_str)

#Delete node attribute "fragname" and swap it with "resnames"
resnames = nx.get_node_attributes(graph, 'fragname')
nx.set_node_attributes(graph, resnames, 'resname')

for node in graph.nodes:
    del graph.nodes[node]["fragname"]

#Create node attribute "resid"
resids = {node:node+1 for node in graph.nodes}
nx.set_node_attributes(graph, resids, 'resid')

#check nodes
print(graph.nodes(data=True))  ↴

[(0, {'resname': 'A', 'resid': 1}), (1, {'resname': 'B', 'resid': 2}), (2, {'resname': 'C', 'resid': 3}), ...]

#check edges
print([(graph.nodes[edge[0]]["resname"],  graph.nodes[edge[1]]["resname"] ) for edge in graph.edges])  ↴

[('A', 'B'), ('B', 'C'), ('B', 'A'), ('C', 'D'), ('D', 'E'), ('A', 'B'), ('B', 'C'), ('B', 'A'), ('C', 'D'), ('D', 'E'), ('A', 'B'), ('B', 'C'), ('C', 'D'), ('D', 'E')]

Thanks again for the all help and the tool you've developed! Finally, to answer your question: I'm currently doing my Master Thesis at ITQB NOVA in Lisbon in @MeloLab under the supervision of @mnmelo .

fgrunewald commented 2 months ago

@Jonifrc312 glad to hear and good catch! Of course the atomnames have to match and might get overwritten in this case. I wasn't considering this issue. I'll keep it in mind for the full implementation.

Just FYI when constructing coordinates you might want to have a look into providing a [ bending ] directive to keep parts of your polymer stiffer if needed. Say hi to Manel! Let me know if you have any more questions.