Is it feasible to apply DESlib methods on Big Data (voxel-wise classification)?

sara-eb commented 5 years ago

Hi,

I am working on segmentation task in medical images. I was trying to apply DS methods on my research problem, voxel-wise classification. I have extracted some specific feature from the voxels inside a mask area (region of interest).

The number of instances for each patient becomes around 500K. For the starting point, I chose 2 patients as training, and one patient as validation/test (asdsel data). Next, I created a pool of classifiers including SVC, KNeighborsClassifier, and GaussianNB. Then, I used DESClustering and OLA as DS methods. After fitting the DS models on dsel data, it took very long time (almost 12 hours). This can be applied in practical situations.

I was willing to apply this methods to decrease the number of false positives, however, the computational and time complexity is high.

What do you suggest?
Is it really feasible to apply DS methods on big data, which includes millions of instances?

Thanks

Menelau commented 5 years ago

Hello,

Well usually when handling large datasets, the bottleneck in the DS methods is the calculation of the region of competence due to sklearn being a bit slow for similarity search when you have plenty of data and they also have high dimensionality (so the KD-trees algorithm does not work). It is very likely that this is the cause of your simulation being extremely slow.

Because of that we also allow a faster similarity search using the Faiss library (https://github.com/facebookresearch/faiss) for the region of competence estimation which allows fast similarity search to very large datasets. That is the only way I can see to apply DS methods to this type of datasets.

You can check an example comparing the use of Faiss instead of sklearn with the DS methods in the following benchmark: https://github.com/scikit-learn-contrib/DESlib/blob/master/benchmarks/bench_speed_faiss.py This example compares the speed of using Faiss vs sklearn similarity search using the Higgs dataset which consists of 11 million data points. Note that it also allows similarity search using GPU which could make the definitions of the regions of competence even faster for very large datasets.

We have plans to make this search even faster using different approximations of nearest neighbors available on Faiss (See issue #140).

sara-eb commented 5 years ago

Dear @Menelau, thanks a lot for your prompt response.

Sure, I will try Faiss, hopefully it works for my research problem?

scikit-learn-contrib / DESlib

Is it feasible to apply DESlib methods on Big Data (voxel-wise classification)? #166