sissa-data-science / DADApy

Distance-based Analysis of DAta-manifolds in python
https://dadapy.readthedocs.io/
Apache License 2.0
103 stars 18 forks source link

compute_id_binomial_k defined twice with inconsistent signature #94

Closed diegodoimo closed 8 months ago

diegodoimo commented 1 year ago

Subject of the issue

The compute_id_binomial_k method is defined twice: once in the id_discrete.py module and the other in the id_estimation.py module. The two functions take different numbers/types of inputs and the implementation in the id_discrete.py module is overridden by that in id_estimation.py.

Steps to reproduce

from dadapy.data import Data import numpy as np

x = np.random.normal(size = (100, 2)) d = Data(coordinates = x) idstmp, , _ = d.compute_id_binomial_k(k = 3)

Expected behavior

Correctly compute the ID with the discrete routine.

Actual behavior

TypeError: IdEstimation.compute_id_binomial_k() missing 1 required positional argument: 'r'

AldoGl commented 1 year ago

Thanks @diegodoimo! These are in fact two different estimators, both based on the binomial distribution but one working in continuous spaces and the other in discrete spaces.

The "Data" class only has access to the continuous estimator, since it inherits from Clustering but not from IdDiscrete. To access the discrete estimator you need to instantiate an IdDiscrete class and call its methods.

Currently I think this makes sense, since for discrete spaces we do not have many algorithms implemented. In the future we might think of ways of integrating all methods under a single framework.

@imacocco any thoughts on this?

imacocco commented 1 year ago

That's precisely how you pointed it out, I named the functions the same way because they are conceptually identical but operate in two different frameworks. At the moment the routines for the discrete estimator do not interact with those in the continuum and have to be called explicitly by the module id_discrete.py. We still might want to name them differently if this can be confusing

diegodoimo commented 1 year ago

Ok. Personally, I would change the name of one of the two anyway. I believe that two different functions, however similar they may be to each other, should have different names. Moreover, if in the future we will integrate it with the rest, maybe we will have to do it in any case.

diegodoimo commented 1 year ago

I have another issue with the discrete implementation, which I just tested: if I do the following:

****

from dadapy.id_discrete import IdDiscrete as ie d = ie(distances = (dist, dist_index)) idstmp, , _ = d.compute_id_binomial_k(k = 3)

**

I got the following error: "AssertionError: select a proper ratio 0<r<1"

I expect to be able to run the function correctly as I have initialized the class correctly and called the function with all the required mandatory arguments (k, only in this case)

imacocco commented 1 year ago

Just fixed the problems you mentioned, now it should work properly. Waiting for feedback before closing the issue

diegodoimo commented 1 year ago

Good for me. Let's wait for @AldoGl

AldoGl commented 1 year ago

Thanks for the change @imacocco. I see you simply set the default to r=0.5. This is ok, but have you considered setting r to the optimal value provided my the minimisation of the Cramer-Rao bound?

imacocco commented 1 year ago

The problem is that r_opt depends on the intrinsic dimension, in such case we should write an iterative function. And it is well defined in the continuum, on the discrete we might just extrapolate such behaviour

AldoGl commented 1 year ago

Yes you are totally right! That would require some kind of iteration and it's not to be included there. Sorry :)