pierrepo / grodecoder

GroDecoder extracts and identifies the molecular components of a structure file (PDB or GRO) issued from a molecular dynamics simulation.
https://grodecoder.streamlit.app/
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Update `get_graph_fingerprint2()` #40

Closed pierrepo closed 2 months ago

pierrepo commented 2 months ago

In get_graph_fingerprint2():

It's better to output atoms counts ordered by frequency (from the most frequent atom to least frequent).

Instead of:

https://github.com/pierrepo/grodecoder/blob/515dfaab8bccf6d32933f064c11e84cd4b1ca64c/grodecoder.py#L403-L404

use:

atom_names = Counter(nx.get_node_attributes(graph, "atom_name").values())
atom_names = dict(atom_names.most_common())

Also, the extraction of the residues sequence could be simplified:

res_names = []
current_residue_id = None
for _, node_attr in sorted(graph.nodes.items(), key=lambda x: x[0]):
    residue_id = node_attr["residue_id"]
    res_name = node_attr["residue_name"]
    if residue_id != current_residue_id:
        res_names.append(mol_def.AMINO_ACID_DICT.get(res_name, res_name))
        current_residue_id = residue_id

with

# Get all residue ids and resides names pairs.
residue_pairs = zip(
    nx.get_node_attributes(graph, "residue_id").values(),
    nx.get_node_attributes(graph, "residue_name").values()
)
# Convert to dictionnary to have only one residue id (key is unique in dict).
residue_pairs_dict = dict(residue_pairs)
# Then extract residue names ordered by residue ids:
residue_names = [residue_pairs_dict[key] for key in sorted(residue_pairs_dict)]

Note that residue_names contains the sequence as it could appear in a protein. If two "ALA" residues are following, "ALA" should be presented twice in a row.

Eventually, the count of degrees:

# Exemple :
# graph.degree = [(1, 1), (2, 3), (3, 1), (4, 2), (5, 1)]
# dict(graph.degree)) = {1: 1, 2: 3, 3: 1, 4: 2, 5: 1}
# dict(graph.degree).values()) = [1, 3, 1, 2, 1]
# ==> graph_degrees_dict = {1: 3, 2: 1, 3: 1}
graph_degrees_dict = dict(Counter(sorted(dict(graph.degree).values())))

could be replaced by:

# Exemple :
# graph.degree = [(1, 1), (2, 3), (3, 1), (4, 2), (5, 1)]
graph_degrees_dict = dict(
    Counter([degree for _, degree in graph.degree])
    .most_common()
)