sararselitsky / FastPG

Fast phenograph, CyTOF
Other
25 stars 6 forks source link

it needs the seed to replicate the results #16

Closed abbatidanilo closed 2 years ago

abbatidanilo commented 2 years ago

Dear Sara, My name is Danilo Abbati and I am working as a data analyst in Chiara Bonini's lab at the San Raffaele hospital in Milan. Recently I published a paper on the European Journal of Immunology about the usage of cytoChain (https://doi.org/10.1002/eji.202049103). The web-app collects most of the best-in-breed packages which can be useful to manipulate and to analyze the cytometry dataset. It is basically an interface for all the researchers which are involved in high dimensional cytometry experiments, and it is still growing and updating together with the improvements and the advancements in the informatic community. For this is reason I suddenly try to implement your valuable fastPG package and I am impressed about its speed and performances. Of course I will put all the references of your package inside cytoChain, as soon as I will be able to put the web-app in production. The only big limitation is that inside the fastPG::fastCluster I cannot enter any seed value, in order to make the analysis reproducible: it is a fundamental feature for a complete analysis. Is there any way to implement it, in a future release? Thank you and my compliments (don't hesitate to contact me if you want to use cytoChain) all my best Danilo

sararselitsky commented 2 years ago

Thanks for your interest, Danilo! I'm glad that you like FastPG! I hope we can find a way to make this reproducible.

Stuart, is there a way to set a seed? Mahantesh, can Grappolo be reproduced with a seed?

On Fri, Oct 1, 2021 at 10:15 AM abbati @.***> wrote:

Dear Sara, My name is Danilo Abbati and I am working as a data analyst in Chiara Bonini's lab at the San Raffaele hospital in Milan. Recently I published a paper on the European Journal of Immunology about the usage of cytoChain (https://doi.org/10.1002/eji.202049103). The web-app collects most of the best-in-breed packages which can be useful to manipulate and to analyze the cytometry dataset. It is basically an interface for all the researchers which are involved in high dimensional cytometry experiments, and it is still growing and updating together with the improvements and the advancements in the informatic community. For this is reason I suddenly try to implement your valuable fastPG package and I am impressed about its speed and performances. Of course I will put all the references of your package inside cytoChain, as soon as I will be able to put the web-app in production. The only big limitation is that inside the fastPG::fastCluster I cannot enter any seed value, in order to make the analysis reproducible: it is a fundamental feature for a complete analysis. Is there any way to implement it, in a future release? Thank you and my compliments (don't hesitate to contact me if you want to use cytoChain) all my best Danilo

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sararselitsky/FastPG/issues/16, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQD6T3MPW4T6MYMOABNYNLUEW7AXANCNFSM5FEZPTQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sararselitsky commented 2 years ago

Deterministic behavior for parallel algorithms is hard. The RcppHNSW package that FastPG relies on is not deterministic, so FastPG will not be, even if Grappolo was. It might be possible for RcppHNSW to be made deterministic, but that is not up to us. I will post an issue with them asking about this.

jefferys commented 2 years ago

Using the data as described in the readme, the hnsw_knn() function from the RcppHNSW package that FastPG uses internally does not generate deterministic results:

set.seed( 1234 )
all_knn <- RcppHNSW::hnsw_knn(
     data, k= k, distance= 'l2', M= 16, ef_construction= 200,
     ef= k, verbose= FALSE, progress= 'bar', n_threads = num_threads,
     grain_size = 1)

set.seed( 1234 )
all_knn2 <- RcppHNSW::hnsw_knn(
     data, k= k, distance= 'l2', M= 16, ef_construction= 200,
     ef= k, verbose= FALSE, progress= 'bar', n_threads = num_threads,
     grain_size = 1)

sum(all_knn$idx - all_knn2$idx)
#> [1] 8208276    # Should be 0!
abbatidanilo commented 2 years ago

Thank you Sara, probably all_knn performs several random choices for each iteration: the set.seed() must be initialized before every random choice. Definitely this is not bug. But yes, it is an issue... cheers

jefferys commented 2 years ago

The underlying python hnsw library that RcppHNSW uses is not deterministic in parallel. It is deterministic in single threaded mode, but that defeats the purpose of FastPG :)

sararselitsky commented 2 years ago

Hi Sara,

[I see that you did not include Danilo in the email chain].

Grappolo --- the algorithm is non-deterministic in nature. The nondeterminism emerges from the dynamic aspect of how different threads process vertices (parallelism), and not because of the use of pseudorandom number generator used in the implementation. Additional nondeterminism arises from the use of graph coloring, which is inherently random in nature (using pseudorandom numbers specifically).

Hope this helps ---- making Grappolo nondeterministic will make it serial and therefore nonscalable. With that said, the quality is independent of this nondeterminism. In a way, as long as vertex x and y are in the same cluster (if in the ground truth), we don’t care what cluster id they get.

Regards, Mahantesh (he/him)

On 10/1/21, 10:12 AM, "Jefferys, Stuart R" @.***> wrote:

Check twice before you click! This email originated from outside PNNL.

Deterministic behavior for parallel algorithms is hard. The RcppHNSW package that FastPG relies on is not deterministic, so FastPG will not be, even if Grappolo was. It might be possible for RcppHNSW to be made deterministic, but that is not up to us. I will post an issue with them asking about this.
sararselitsky commented 2 years ago

I think our emails crossed :)

Stuart.

On Oct 1, 2021, at 2:39 PM, Halappanavar, Mahantesh M @.***> wrote:

Hi Sara,

[I see that you did not include Danilo in the email chain].

Grappolo --- the algorithm is non-deterministic in nature. The nondeterminism emerges from the dynamic aspect of how different threads process vertices (parallelism), and not because of the use of pseudorandom number generator used in the implementation. Additional nondeterminism arises from the use of graph coloring, which is inherently random in nature (using pseudorandom numbers specifically).

Hope this helps ---- making Grappolo nondeterministic will make it serial and therefore nonscalable. With that said, the quality is independent of this nondeterminism. In a way, as long as vertex x and y are in the same cluster (if in the ground truth), we don’t care what cluster id they get.

Regards, Mahantesh (he/him)

On 10/1/21, 10:12 AM, "Jefferys, Stuart R" @.***> wrote:

Check twice before you click! This email originated from outside PNNL.

Deterministic behavior for parallel algorithms is hard. The RcppHNSW package that FastPG relies on is not deterministic, so FastPG will not be, even if Grappolo was. It might be possible for RcppHNSW to be made deterministic, but that is not up to us. I will post an issue with them asking about this.

sararselitsky commented 2 years ago

Indeed __

I think you can simply set OMP_NUM_THREADS to 1 (export OMP_NUM_THREADS=1 or setenv OMP_NUM_THREADS 1), and turn coloring off (remove -c 1 from the execution call). The minimum label heuristic we use should make the algorithm deterministic when run by a single thread. I do not have an analytical proof of this or thought about it well to say for sure (I will check our paper from 2015 to see if said something).

Regards, Mahantesh (he/him)

On 10/1/21, 11:47 AM, "Jefferys, Stuart R" @.***> wrote:

I think our emails crossed :)

Stuart.

> On Oct 1, 2021, at 2:39 PM, Halappanavar, Mahantesh M ***@***.***> wrote:
> 
> Hi Sara,
> 
> [I see that you did not include Danilo in the email chain].
> 
> Grappolo --- the algorithm is non-deterministic in nature. The nondeterminism emerges from the dynamic aspect of how different threads process vertices (parallelism), and not because of the use of pseudorandom number generator used in the implementation. Additional nondeterminism arises from the use of graph coloring, which is inherently random in nature (using pseudorandom numbers specifically). 
> 
> Hope this helps ---- making Grappolo nondeterministic will make it serial and therefore nonscalable. With that said, the quality is independent of this nondeterminism. In a way, as long as vertex x and y are in the same cluster (if in the ground truth), we don’t care what cluster id they get. 
> 
> Regards,
> Mahantesh (he/him) 
> 
> On 10/1/21, 10:12 AM, "Jefferys, Stuart R" ***@***.***> wrote:
> 
>    Check twice before you click! This email originated from outside PNNL.
> 
> 
>    Deterministic behavior for parallel algorithms is hard. The RcppHNSW package that FastPG relies on is not deterministic, so FastPG will not be, even if Grappolo was. It might be possible for RcppHNSW to be made deterministic, but that is not up to us. I will post an issue with them asking about this.
> 
>