tristanic / pae_to_domains

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix
MIT License
32 stars 7 forks source link

Interpretation of output #4

Open ecremelie opened 3 months ago

ecremelie commented 3 months ago

Hi there,

Thank you for sharing this code! I have tried it on several samples, and am a bit confused with the output in the csv:

256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,245,246,247,248,249,250,251,252,253,254,255,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Thank you in advance!

tristanic commented 3 months ago

Wow... it's been a while since I've looked at this code! The upshot here is that these will be cases where the final folded domain isn't made up of contiguous stretches of protein (e.g. where there's a long unstructured loop, or where the chain leaves one domain, folds into another before returning to the first, etc.). The code is ultimately aiming to find groups of residues that AlphaFold believes move as near-rigid bodies. As far as the code doing the clustering is concerned the residue numbers are just unique labels on graph nodes, so they're returned as sets where any residual ordering is more-or-less coincidental. If you want them to be ordered in the output .csv, you could do it by changing https://github.com/tristanic/pae_to_domains/blob/f407c6035c825f151a56f28bf803fcb44321b941/pae_to_domains.py#L135 to:

    clusters = [list(sorted(c)) + ['']*(max_len-len(c)) for c in clusters]

Hope that makes sense!