Interpretation of output

Hi there,

Thank you for sharing this code! I have tried it on several samples, and am a bit confused with the output in the csv:

I understand that the domains are listed based on their size, meaning that large domains are given in the first rows and the small domains, often protein linkers, are given in the later rows.
As would be logical to me, the order of residue indices per row (or domain) are ordered numerically. However, in several runs i have performed, it occured that this was not the case, and it looks like two domains are pasted in one row. Here are two examples (they look the same, but they are from different samples):

256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,245,246,247,248,249,250,251,252,253,254,255,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Is there a specific reason or interpretation for this or would this be a bug? I am asking, as i would like to extract the domain and their linkers delineations based on the first and last entry of each row.

Thank you in advance!

Wow... it's been a while since I've looked at this code! The upshot here is that these will be cases where the final folded domain isn't made up of contiguous stretches of protein (e.g. where there's a long unstructured loop, or where the chain leaves one domain, folds into another before returning to the first, etc.). The code is ultimately aiming to find groups of residues that AlphaFold believes move as near-rigid bodies. As far as the code doing the clustering is concerned the residue numbers are just unique labels on graph nodes, so they're returned as sets where any residual ordering is more-or-less coincidental. If you want them to be ordered in the output .csv, you could do it by changing https://github.com/tristanic/pae_to_domains/blob/f407c6035c825f151a56f28bf803fcb44321b941/pae_to_domains.py#L135 to:

    clusters = [list(sorted(c)) + ['']*(max_len-len(c)) for c in clusters]

Hope that makes sense!

tristanic / pae_to_domains

Interpretation of output #4