Closed ulupo closed 3 years ago
Thanks Umbe! What happened to putting it in a metaestimators
submodule?
Also, I would like to make alternative proposals for the name:
Thanks Umbe! What happened to putting it in a metaestimators submodule?
I'm not against this, but I saw that scikit-learn
has the same structure with a metaestimators
file in utils
. I sort of see this as more of a util-type feature than an original metaestimator of ours (contrasted with MapperPipeline
which is a metaestimator made with original components), so it made sense to me to put it in such a place. But we don't have to follow scikit-learn
in everything.
The origin of the name is from https://www.neuraxle.org/stable/api/neuraxle.steps.loop.html#neuraxle.steps.loop.ForEachDataInput which seems to be a similar thing. But CollectionTransformer
works perfectly well.
@gtauzin I've implemented both the renaming and the creation of the metaestimators
subpackage.
@gtauzin thanks for the review! Let me address both your comments concerning joblib backends in one go. I believe this is a good feature and should be kept, from our experience based on ParallelClustering
in gtda.mapper.cluster
. ParallelClustering
offers this choice and, when using certain algorithms (e.g. DBSCAN
) passing 'threading'
can improve performance a huge amount relative to the default backend or to 'processes'
. Context managers are annoying to use and they are also global, which is not ideal in a pipeline (different steps might benefit from different kinds of parallelism). So I think this is a good feature for both performance and usability reasons.
@gtauzin thanks for the review! Let me address both your comments concerning joblib backends in one go. I believe this is a good feature and should be kept, from our experience based on
ParallelClustering
ingtda.mapper.cluster
.ParallelClustering
offers this choice and, when using certain algorithms (e.g.DBSCAN
) passing'threading'
can improve performance a huge amount relative to the default backend or to'processes'
. Context managers are annoying to use and they are also global, which is not ideal in a pipeline (different steps might benefit from different kinds of parallelism). So I think this is a good feature for both performance and usability reasons.
I am very surprised to read that. In the link that you cite, it is written:
“threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy).
Python does not support multithreading but the binded C/C++ code can sometimes but then you have to release the GIL. Numpy can but only on really big calls and if we would fall into that category, we would not have much control on it. I strongly doubt we do. This is also why sklearn
does not give you control on the prefer
parameter.
Note that the user can always select its backend. That's how dask_ml works:
with joblib.parallel_backend('dask'):
A fair amount of scikit-learn
code is written in Cython and some releases the GIL (with nogil
etc). This was already true in 2014, see Grisel's answer to this StackExchange question https://stackoverflow.com/questions/7391427/does-scikit-learn-release-the-python-gil. I am quite confident about the performance boosts with DBSCAN
in mapper as I profiled it extensively at the time. I can provide reproducing code. Furthermore, I repeat my point that context managers (i.e. the with
call you mentioned which I had already referred to in my previous message) are annoying and cannot be localised easily to a single step in a pipeline. What do you think?
My interpretation is that scikit-learn
does not expose prefer
to the user because there would typically be one backend that is best in most cases given a specific algorithm. But we don't know in advance what transformer will be passed and hence whether a different backend will be beneficial. In scikit-learn
, 'threads'
is chosen sometimes for the user without their knowledge, see https://github.com/scikit-learn/scikit-learn/blob/06d6f8a9f6e3d847b9db79a5fdebe64f78deef7f/sklearn/neighbors/_base.py, https://github.com/scikit-learn/scikit-learn/blob/06d6f8a9f6e3d847b9db79a5fdebe64f78deef7f/sklearn/ensemble/_forest.py, etc. The fact that nearest-neighbours can use threads is perhaps linked to my observations with DBSCAN
.
Yes, I agree that sklearn may do it on multithreaded pieces of code they wrote and for which they released the GIL. But AFAIK, they don't expose prefer
to the user.
But let me rephrase my point: In the current use, the parallel_backend_prefer
is applied to our Parallel
call which applies to functions on which the GIL will never be released. DBSCAN
may be faster with prefer="threads"
, but we are parallelizing a loop on calls of DBSCAN
for a collection of inputs. This never releases the GIL and is never multithreaded. Maybe I misunderstood, but as we are offering parallelization at the collection scale, that won't work.
But AFAIK, they don't expose prefer to the user.
I agree, and proposed an explanation in my previous comment.
You seem not to believe/engage with my claim about DBSCAN
and mapper. My understanding is that we are precisely looping calls of DBSCAN
over different inputs there (see https://github.com/giotto-ai/giotto-tda/blob/2061e29e08457f8746dd1808ab8d503060212a24/gtda/mapper/cluster.py#L140, where the idea is that X_tot
is always the full dataset, but mask
is just a boolean mask selecting subportions which DBSCAN
is actually run on). I'll work on reproducing code!
I think we should discuss that more over lunch xD
Wise idea!! :D
Reference issues/PRs Implements the first proposal in #490.
Types of changes
Description A
CollectionTransformer
meta-estimator is added which converts any transformer instance into a fit-transformer acting on collections. This is exactly the first proposal in #490. It is placed in a new subpackage calledmetaestimators
.Checklist
flake8
to check my Python changes.pytest
to check this on Python tests.