Also, the extraction of the residues sequence could be simplified:
res_names = []
current_residue_id = None
for _, node_attr in sorted(graph.nodes.items(), key=lambda x: x[0]):
residue_id = node_attr["residue_id"]
res_name = node_attr["residue_name"]
if residue_id != current_residue_id:
res_names.append(mol_def.AMINO_ACID_DICT.get(res_name, res_name))
current_residue_id = residue_id
with
# Get all residue ids and resides names pairs.
residue_pairs = zip(
nx.get_node_attributes(graph, "residue_id").values(),
nx.get_node_attributes(graph, "residue_name").values()
)
# Convert to dictionnary to have only one residue id (key is unique in dict).
residue_pairs_dict = dict(residue_pairs)
# Then extract residue names ordered by residue ids:
residue_names = [residue_pairs_dict[key] for key in sorted(residue_pairs_dict)]
Note that residue_names contains the sequence as it could appear in a protein. If two "ALA" residues are following, "ALA" should be presented twice in a row.
In
get_graph_fingerprint2()
:It's better to output atoms counts ordered by frequency (from the most frequent atom to least frequent).
Instead of:
https://github.com/pierrepo/grodecoder/blob/515dfaab8bccf6d32933f064c11e84cd4b1ca64c/grodecoder.py#L403-L404
use:
Also, the extraction of the residues sequence could be simplified:
with
Note that
residue_names
contains the sequence as it could appear in a protein. If two "ALA" residues are following, "ALA" should be presented twice in a row.Eventually, the count of degrees:
could be replaced by: