tristanic / pae_to_domains

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix
MIT License
32 stars 7 forks source link

Setting seed - reproducible number of cluster #2

Open Ni-Ar opened 2 years ago

Ni-Ar commented 2 years ago

Hi,

I was playing around with the different threshold parameters and realised that the same run can yield different number of clusters.

python pae_to_domains.py *_predicted_aligned_error_v1.json --pae_cutoff 10 --pae_power 1 --resolution 1
Wrote 13 clusters to clusters.csv. Biggest cluster contains 54 residues. Run time was 0.86 seconds.

python pae_to_domains.py *_predicted_aligned_error_v1.json --pae_cutoff 10 --pae_power 1 --resolution 1
Wrote 14 clusters to clusters.csv. Biggest cluster contains 54 residues. Run time was 0.73 seconds.

Is this something you also observed on the same input pae json file? I suspect this comes from the igraph community_leiden step, but I might be wrong.

tristanic commented 2 years ago

Yes, the iGraph algorithm does involve some randomness - see the Python docstring at https://github.com/igraph/python-igraph/blob/950d61a1c4dec3d0793c3b5327f154d64009f536/src/igraph/__init__.py#L1587 and the underlying C function at https://github.com/igraph/igraph/blob/6798f825df7712f1351aa7ec1a6e56ecdf1bde26/src/community/leiden.c#L906. Unfortunately it doesn't provide an option to specify the random seed, so some slight variation from run to run will be expected. The NetworkX algorithm is (to my knowledge) deterministic, albeit quite a bit slower due to being pure Python.

Ni-Ar commented 2 years ago

Hi @tristanic,

thanks a lot for explaining. Probably the easiest way to control for this is using both iGraph and NetworkX algorithms and use the overlap. Have you tried that? Does that work in your opinion?

Thanks, Nicco

tristanic commented 2 years ago

Could do. To be honest I think there's quite a lot more that could be done here, but my original intent for this was as a quick proof-of-principle that others could pick up and develop further (it's interesting, but a bit tangential to my day-to-day work). So feel free to explore!

Ni-Ar commented 2 years ago

This is also tangential to my day-to-day work :D I really like the idea, but yeah bit more exploration is needed on my side for what I'd like it to use. Thanks for sharing this. Let's see how much exploration time I can dedicate to this :)

fdaqin commented 1 year ago

Hi guys, it's simple to set a seed for the script to ensure same run has same outputs, just add following lines before the clustering step import random random.seed(1234)

Ni-Ar commented 1 year ago

Have you tried that or are you just guessing it would make the results reproducible?

fdaqin commented 1 year ago

Have you tried that or are you just guessing it would make the results reproducible?

Yeah, I got the same results, you can have a try.

   if lib=='igraph':
        f = domains_from_pae_matrix_igraph
    else:
        f = domains_from_pae_matrix_networkx

    import random
    random.seed(1234)

    clusters = f(pae, pae_power=args.pae_power, pae_cutoff=args.pae_cutoff, graph_resolution=args.resolution)
Ni-Ar commented 1 year ago

Thanks! I'll give it a try