sissa-data-science / DADApy

Distance-based Analysis of DAta-manifolds in python
https://dadapy.readthedocs.io/
Apache License 2.0
99 stars 16 forks source link

Dadac integration alt #107

Open lykos98 opened 6 months ago

lykos98 commented 6 months ago

Proposed changes

Optimization of ADP procedure and explicit parallelization of many tasks through openmp.

Types of changes

In the last year (or so) I have been working on my implementation of ADP due to the fact the original one, provided in dadapy was not able to scale well for my use case. Together with the help of @alexdepremia and @lucatornatore I developed a very small library in C (dadaC) which implemented the ADP procedure with the following optimizations:

The implementation has been tested against the one in dadapy and preserves a 1 to 1 match of the results in terms of cluster labels, border indices and densities. A detailed discussion, tests, and benchmarks can be found in the repository of my implementation https://github.com/lykos98/dadaC .

Integration in the library requires minimal modifications of the code to allow the implementation in C to interface with yours, actually it requires to add dadaC as dependency, since dadaC requires to build a small .so file that is then used by an interface in python built with ctypes. I kindly ask you if this can be in line with your development, and I am open to modify my implementation to comply with your conventions if needed.

Thank you for your attention, let me know your opinion. F

AldoGl commented 5 months ago

Wow!! Thank you @lykos98, I had a look at your dadaC repo, at the benchmarks, and at you your PR. It all seems excellent work to me. I was considering asking you to contribute the entire dadaC codebase in dadapy, but probably is better to keep them separate as you originally designed. I would ask you to make the tests pass, then I will have a closer look as soon as I have time before merging but it should not take much

lykos98 commented 5 months ago

Goodafternoon, I fixed the issue on the tests and run them in my fork, everything seems fine. The issue was linked to a leftover, I overlooked in the diff.

I think that ideally the codebase of dadac should be included but when I tried to perform that kind of integration it was not easy to deal with the inclusion of the .so file needed for dadac to run. In the next future moreover, I think I will also ship dadac as a proper package with precompiled binaries.

Thank you! F

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 90.74074% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 81.17%. Comparing base (f0b3fdd) to head (d065ae1).

Files Patch % Lines
dadapy/clustering.py 90.74% 5 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #107 +/- ## ========================================== + Coverage 81.07% 81.17% +0.10% ========================================== Files 17 17 Lines 3006 3034 +28 ========================================== + Hits 2437 2463 +26 - Misses 569 571 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.