vnmabus / dcor

Distance correlation and related E-statistics in Python
https://dcor.readthedocs.io
MIT License
144 stars 26 forks source link

Can dcor with method 'AVL' or 'megresort' is applicable between two data types float and integer, respectively or it always has to be float? #51

Closed Palash123-4 closed 5 months ago

vnmabus commented 1 year ago

Currently, they need to be both floats, as the signature of the Numba functions is typed explicitly. Is there any reason that makes casting it to float before the call undesirable?

Palash123-4 commented 1 year ago

I am interested in finding the distance correlation between two random vectors where one is continuous or a mixture of 'continuous and discrete' while the other variable is purely discrete. 'Naive' method requires a significantly large memory. That's why, I am interested in implementing the idea of fast computation of distance covariance using random projections that have been published in "https://www.sciencedirect.com/science/article/pii/S2352711023000225".

vnmabus commented 1 year ago

I think you mean the random projection method described in https://arxiv.org/abs/1701.06054 (the paper you have mentioned is the paper for this package, which does not mention random projections at all). If you manage to implement it and want to make a PR, that would be an useful addition to this package.

If I understand correctly, you need to call the univariate fast method in order to implement the random projection method. I still do not know why can't you just cast the array to float beforehand, e.g. using astype or asfarray, as the computations required cannot be done using just integer arithmetic nevertheless. Is there a problem with that approach?

Palash123-4 commented 1 year ago

First of all, I am very sorry for the incorrect citation of the paper.

You exactly identify the thing that I am trying to implement. Yes, my intention is to implement the random projection (to transform the multivariate data to univariate) and then use the fast implementation either by 'AVL' or 'mergesort'. I will give you an update as soon as I successfully complete that.

I got that I have to change the type of the array which I did and now I don't have any execution issues. Thanks again for the clarification.

Palash123-4 commented 1 year ago

Hi, Carlos. It's a pleasure to tell you that I have successfully implemented the fast computation of distance covariance estimate and the corresponding independence testing method as proposed in "https://www.frontiersin.org/articles/10.3389/fams.2021.779841/full". I am sharing the Python implementation for this: https://github.com/sca-research/Leakage_detection_testing/blob/main/Code/RP_dcor.py I did verify the code for multivariate normal data, Have a look at it and let me know if you find any issues.

Palash123-4 commented 1 year ago

The test of independence code for multidimensional distance correlation as mentioned in "https://www.frontiersin.org/articles/10.3389/fams.2021.779841/full"

RP_dcor.zip

Palash123-4 commented 1 year ago

Did you find the Python code that I uploaded in my previous comment? Please let me know if you find any bugs, it would be helpful in my present research also.

vnmabus commented 1 year ago

Sorry, I had my thesis defense this week and I was not able to check the code. I will do it when I have more time (hopefully next week).

vnmabus commented 1 year ago

I had a peek at the paper, and your code. However if you want to add it to this package it would be better to submit it as a PR, both for authorship and for facilitating the review process.

Even if the code is small, the review could take a bit of time in order to integrate it well with existing code and conventions, add tests, and fix bugs/improve performance.

Palash123-4 commented 1 year ago

Can you please tell me what's the process to submit it as a PR?

vnmabus commented 1 year ago

Sure! In order to do that, you have to fork this project (using the fork button in the code tab). This creates a copy of the project under your Github account, which you can then modify.

If you plan to do more PRs in the future, it is better if you create a branch from develop, and then do the modifications in that branch, but it is not strictly required. You need to modify your fork to include the new code. In this case the random projection estimator should probably be in the _dcor.py file and the hypothesis test in independence.py. Unitary tests should be in the appropriate module of the "tests" subfolder.

After you commit the modifications, a button will pop in the "Code" tab of this project (the original copy) that allow you to submit them as a PR. Alternatively you can go to the "Pull requests" tab, click "New pull request" and choose the branch of your copy as the source.

Palash123-4 commented 1 year ago

I am busy this month, but I will try to finish this by 1st two weeks of next month.

Palash123-4 commented 9 months ago

An update: I have added the code. When you will find sometime, let me know how can I further contribute