scikit-learn-contrib / metric-learn

Metric learning algorithms in Python
http://contrib.scikit-learn.org/metric-learn/
MIT License
1.4k stars 234 forks source link

Out of core computations #137

Open wdevazelhes opened 5 years ago

wdevazelhes commented 5 years ago

Description

One next step for the project is to be able to do out of core learning/predicting: i.e. being able to fit/predict models when we cannot load the data all at once but only by small batches, because the data is too big for the memory.

Some possibilities for out of core learning:

  1. Make a partial_fit function, and make batches manually and do for loops for all batch: estimator.partial_fit(batch).
    • We should do manual cross-validation because we don't know in advance all the pairs (hence cannot make several train/test automatically)
    • For tuples learners, a generated "batch" would be a "batch" of pairs or indices
  2. Similar solution:
  3. For tuples learners we could also have as an input all X and tuples indices, and inside the algorithm we would form the tuples by batch
    • We could use scikit-learn cross-validation here because we would know in advance all the pairs
  4. Use of Dask arrays
    • There we would already know the size of the dataset (the lenght of the Dask Array) (hence could do cross-validation automatically for instance), but these arrays are treated in a batch fashion automatically.

(More details to be added)

wdevazelhes commented 5 years ago

@wdevazelhes Basically input_fn is a function that returns a dictionary of features and targets. It is exposed to users so that they can transform their batches of data to avoid feeding the entire dataset at once and to speed up (asynchronous) training with (distributed) data pipeline. It's a function that gets called that provides batches of data to each fit step so that more flexible data types can be fed in, e.g. iterators or streaming input. For input that is not a function, there are utility methods such as numpy_input_fn and pandas_input_fn that can be used as input_fn, e.g.

@terrytangyuan I answer to your comment here, because I guess it is related to this issue. It is interesting that there is this possibility to provide streaming batches in Tensorflow. I've added it to the list above (solution 2.). If I understand correctly the advantage w.r.t solution 1. (scikit-learn's like partial_fit) is that the users wouldn't have to do the loop on batches and partial_fit by themselves right ? they could just call fit(input_fn) with this function as input, and inside, the algorithm would know to call input_fn to generate batches right ? And indeed, this would allow the algorithm to make asynchronous/parallel iterations (which I guess would be difficult with partial_fit because users would need themselves with the multiprocessing etc)