giotto-ai / giotto-tda

A high-performance topological machine learning toolbox in Python
https://giotto-ai.github.io/gtda-docs
Other
845 stars 173 forks source link

Add metaestimators subpackage with CollectionTransformer meta-estimator #495

Closed ulupo closed 3 years ago

ulupo commented 4 years ago

Reference issues/PRs Implements the first proposal in #490.

Types of changes

Description A CollectionTransformer meta-estimator is added which converts any transformer instance into a fit-transformer acting on collections. This is exactly the first proposal in #490. It is placed in a new subpackage called metaestimators.

Checklist

gtauzin commented 4 years ago

Thanks Umbe! What happened to putting it in a metaestimators submodule? Also, I would like to make alternative proposals for the name:

ulupo commented 4 years ago

Thanks Umbe! What happened to putting it in a metaestimators submodule?

I'm not against this, but I saw that scikit-learn has the same structure with a metaestimators file in utils. I sort of see this as more of a util-type feature than an original metaestimator of ours (contrasted with MapperPipeline which is a metaestimator made with original components), so it made sense to me to put it in such a place. But we don't have to follow scikit-learn in everything.

The origin of the name is from https://www.neuraxle.org/stable/api/neuraxle.steps.loop.html#neuraxle.steps.loop.ForEachDataInput which seems to be a similar thing. But CollectionTransformer works perfectly well.

ulupo commented 4 years ago

@gtauzin I've implemented both the renaming and the creation of the metaestimators subpackage.

ulupo commented 4 years ago

@gtauzin thanks for the review! Let me address both your comments concerning joblib backends in one go. I believe this is a good feature and should be kept, from our experience based on ParallelClustering in gtda.mapper.cluster. ParallelClustering offers this choice and, when using certain algorithms (e.g. DBSCAN) passing 'threading' can improve performance a huge amount relative to the default backend or to 'processes'. Context managers are annoying to use and they are also global, which is not ideal in a pipeline (different steps might benefit from different kinds of parallelism). So I think this is a good feature for both performance and usability reasons.

gtauzin commented 4 years ago

@gtauzin thanks for the review! Let me address both your comments concerning joblib backends in one go. I believe this is a good feature and should be kept, from our experience based on ParallelClustering in gtda.mapper.cluster. ParallelClustering offers this choice and, when using certain algorithms (e.g. DBSCAN) passing 'threading' can improve performance a huge amount relative to the default backend or to 'processes'. Context managers are annoying to use and they are also global, which is not ideal in a pipeline (different steps might benefit from different kinds of parallelism). So I think this is a good feature for both performance and usability reasons.

I am very surprised to read that. In the link that you cite, it is written:

“threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy).

Python does not support multithreading but the binded C/C++ code can sometimes but then you have to release the GIL. Numpy can but only on really big calls and if we would fall into that category, we would not have much control on it. I strongly doubt we do. This is also why sklearn does not give you control on the prefer parameter.

Note that the user can always select its backend. That's how dask_ml works: with joblib.parallel_backend('dask'):

ulupo commented 4 years ago

A fair amount of scikit-learn code is written in Cython and some releases the GIL (with nogil etc). This was already true in 2014, see Grisel's answer to this StackExchange question https://stackoverflow.com/questions/7391427/does-scikit-learn-release-the-python-gil. I am quite confident about the performance boosts with DBSCAN in mapper as I profiled it extensively at the time. I can provide reproducing code. Furthermore, I repeat my point that context managers (i.e. the with call you mentioned which I had already referred to in my previous message) are annoying and cannot be localised easily to a single step in a pipeline. What do you think?

ulupo commented 4 years ago

My interpretation is that scikit-learn does not expose prefer to the user because there would typically be one backend that is best in most cases given a specific algorithm. But we don't know in advance what transformer will be passed and hence whether a different backend will be beneficial. In scikit-learn, 'threads' is chosen sometimes for the user without their knowledge, see https://github.com/scikit-learn/scikit-learn/blob/06d6f8a9f6e3d847b9db79a5fdebe64f78deef7f/sklearn/neighbors/_base.py, https://github.com/scikit-learn/scikit-learn/blob/06d6f8a9f6e3d847b9db79a5fdebe64f78deef7f/sklearn/ensemble/_forest.py, etc. The fact that nearest-neighbours can use threads is perhaps linked to my observations with DBSCAN.

gtauzin commented 4 years ago

Yes, I agree that sklearn may do it on multithreaded pieces of code they wrote and for which they released the GIL. But AFAIK, they don't expose prefer to the user.

But let me rephrase my point: In the current use, the parallel_backend_prefer is applied to our Parallel call which applies to functions on which the GIL will never be released. DBSCAN may be faster with prefer="threads", but we are parallelizing a loop on calls of DBSCAN for a collection of inputs. This never releases the GIL and is never multithreaded. Maybe I misunderstood, but as we are offering parallelization at the collection scale, that won't work.

ulupo commented 4 years ago

But AFAIK, they don't expose prefer to the user.

I agree, and proposed an explanation in my previous comment.

You seem not to believe/engage with my claim about DBSCAN and mapper. My understanding is that we are precisely looping calls of DBSCAN over different inputs there (see https://github.com/giotto-ai/giotto-tda/blob/2061e29e08457f8746dd1808ab8d503060212a24/gtda/mapper/cluster.py#L140, where the idea is that X_tot is always the full dataset, but mask is just a boolean mask selecting subportions which DBSCAN is actually run on). I'll work on reproducing code!

gtauzin commented 4 years ago

I think we should discuss that more over lunch xD

ulupo commented 4 years ago

Wise idea!! :D