Closed Palash123-4 closed 5 months ago
I am interested in finding the distance correlation between two random vectors where one is continuous or a mixture of 'continuous and discrete' while the other variable is purely discrete. 'Naive' method requires a significantly large memory. That's why, I am interested in implementing the idea of fast computation of distance covariance using random projections that have been published in "https://www.sciencedirect.com/science/article/pii/S2352711023000225".
I think you mean the random projection method described in https://arxiv.org/abs/1701.06054 (the paper you have mentioned is the paper for this package, which does not mention random projections at all). If you manage to implement it and want to make a PR, that would be an useful addition to this package.
If I understand correctly, you need to call the univariate fast method in order to implement the random projection method. I still do not know why can't you just cast the array to float beforehand, e.g. using astype or asfarray, as the computations required cannot be done using just integer arithmetic nevertheless. Is there a problem with that approach?
First of all, I am very sorry for the incorrect citation of the paper.
You exactly identify the thing that I am trying to implement. Yes, my intention is to implement the random projection (to transform the multivariate data to univariate) and then use the fast implementation either by 'AVL' or 'mergesort'. I will give you an update as soon as I successfully complete that.
I got that I have to change the type of the array which I did and now I don't have any execution issues. Thanks again for the clarification.
Hi, Carlos. It's a pleasure to tell you that I have successfully implemented the fast computation of distance covariance estimate and the corresponding independence testing method as proposed in "https://www.frontiersin.org/articles/10.3389/fams.2021.779841/full". I am sharing the Python implementation for this: https://github.com/sca-research/Leakage_detection_testing/blob/main/Code/RP_dcor.py I did verify the code for multivariate normal data, Have a look at it and let me know if you find any issues.
The test of independence code for multidimensional distance correlation as mentioned in "https://www.frontiersin.org/articles/10.3389/fams.2021.779841/full"
Did you find the Python code that I uploaded in my previous comment? Please let me know if you find any bugs, it would be helpful in my present research also.
Sorry, I had my thesis defense this week and I was not able to check the code. I will do it when I have more time (hopefully next week).
I had a peek at the paper, and your code. However if you want to add it to this package it would be better to submit it as a PR, both for authorship and for facilitating the review process.
Even if the code is small, the review could take a bit of time in order to integrate it well with existing code and conventions, add tests, and fix bugs/improve performance.
Can you please tell me what's the process to submit it as a PR?
Sure! In order to do that, you have to fork this project (using the fork button in the code tab). This creates a copy of the project under your Github account, which you can then modify.
If you plan to do more PRs in the future, it is better if you create a branch from develop, and then do the modifications in that branch, but it is not strictly required. You need to modify your fork to include the new code. In this case the random projection estimator should probably be in the _dcor.py
file and the hypothesis test in independence.py
. Unitary tests should be in the appropriate module of the "tests" subfolder.
After you commit the modifications, a button will pop in the "Code" tab of this project (the original copy) that allow you to submit them as a PR. Alternatively you can go to the "Pull requests" tab, click "New pull request" and choose the branch of your copy as the source.
I am busy this month, but I will try to finish this by 1st two weeks of next month.
An update: I have added the code. When you will find sometime, let me know how can I further contribute
Currently, they need to be both floats, as the signature of the Numba functions is typed explicitly. Is there any reason that makes casting it to float before the call undesirable?