lisitsyn / tapkee

A flexible and efficient С++ template library for dimension reduction
http://tapkee.lisitsyn.me
BSD 3-Clause "New" or "Revised" License
232 stars 57 forks source link

UMap support #110

Open MohammadFakhreddin opened 2 weeks ago

MohammadFakhreddin commented 2 weeks ago

Hello,

First of all, I want to thank you for this library. I've been looking for a library that I can use to integrate dimensionality reduction techniques into our tool for our paper, and this is perfect for that. (I make sure to cite :))

I would like to ask about the current situation with the UMAP. Is it ready to use?

Also, as a side question, Are you guys aware of any good library for the K-NN classifier?

iglesias commented 2 weeks ago

Hello @MohammadFakhreddin,

Thank you for the interest and nice to read you are finding the library useful.

I might be able to help with the k-NN question. Even though tapkee already includes tree data structures for it as several dimensionality reduction techniques are based on nearest neighbors, you could take a look at Shogun. This notebook should help to get a quick idea of how you can do k-NN in Shogun using the Python interface. Even though the notebook is about LMNN (you can think of it as an extension to k-NN), see e.g. code cell [14] for an example applying k-NN in a metagenomics dataset.

So :) if you are already using tapkee and are attracted to the diving in its code a bit, you will find the NN code and eventually be able to modify to get a k-NN classifier from it and you won't need any other library; if you want a more readily available solution where you can call a k-NN classifie, Shogun could help better, but it may require some effort getting it work, which can widely vary depending on what system you are using (OS, package manager, compiler, ...) and what version of Shougn you would like to use.

MohammadFakhreddin commented 2 weeks ago

@iglesias Thanks a lot! I look into it and try to implement something based on that.

At the moment, I'm trying to keep the build as simple as possible, so I do my best to avoid a complex library. One of our goals is the project's accessibility. We have some prototypes in Python using Scikit Learn, but currently, my aim is the project's longevity and ease of build.

As a side note, I think the cmake minimum version is too high :)

Let me know if anyone knows about the current state of the UMAP library.

lisitsyn commented 2 weeks ago

Hi @MohammadFakhreddin

thanks for reaching out! The UMAP should be a good addition to the library but none of us two have got enough time recently to implement it. As of now there is no implementation even in a branch.

iglesias commented 2 weeks ago

Indeed. On Open Source, I am with the CodeQl stuff and making contributions to GitHub’s coding-standards repo. Would I look into something in tapkee atm, I’d be more interested in some topic related to that (even widely, such as safety with Circle or just even trying the new clang real-time sanitizer on it).

I recalled on umap there was already this https://github.com/lisitsyn/tapkee/pull/95

The umap python repo on github looks quite popular, and there’s also a c++ repo. What would be the goal of adding a new method DR now to tapkee? I wondered and I couldn’t think of any besides completeness in tapkee.

MohammadFakhreddin commented 2 weeks ago

So, I integrated Tapkee into my project and fed it the dataset I used for testing PCA using OpenCV. Strangely, it took 3-4 seconds for OpenCV PCA, while for Tapkee, it took 4-5 minutes. I noticed that OpenCL was present in the cmake. Are you guys using OpenCL for optimization? Can it be that by not including OpenCL in my project, I made Tapkee much slower than OpenCV? (I used the Passiflora dataset, which is a very large dataset. It worked well with smaller datasets :))

cmake_minimum_required (VERSION 3.16)

project (Tapkee  LANGUAGES CXX)

set (CMAKE_CXX_STANDARD 23)
set (TAPKEE_INCLUDE_DIR "${CMAKE_CURRENT_SOURCE_DIR}/include")

include_directories("${TAPKEE_INCLUDE_DIR}")

add_library(tapkee_library INTERFACE)
target_include_directories(tapkee_library INTERFACE "${TAPKEE_INCLUDE_DIR}")
iglesias commented 1 week ago

Hello @MohammadFakhreddin,

assuming I understood your message and questions correctly after reading them a few times, a comparison between OpenCV with GPU acceleration and Tapkee without, providing that PCA is amenable to data-parallelism, would obviously result in a large difference in a large dataset.