Out of core computations

scikit-learn-contrib / metric-learn

Metric learning algorithms in Python

MIT License

1.4k stars 234 forks source link

Description

One next step for the project is to be able to do out of core learning/predicting: i.e. being able to fit/predict models when we cannot load the data all at once but only by small batches, because the data is too big for the memory.

Some possibilities for out of core learning:

Make a partial_fit function, and make batches manually and do for loops for all batch: estimator.partial_fit(batch).
- We should do manual cross-validation because we don't know in advance all the pairs (hence cannot make several train/test automatically)
- For tuples learners, a generated "batch" would be a "batch" of pairs or indices
Similar solution:
- We could allow the input to be a function that yields batches, and the estimator would automatically, when calling fit, load one batch at a time (by calling this function) for each iteration. see comment below https://github.com/metric-learn/metric-learn/issues/137#issuecomment-446906001
For tuples learners we could also have as an input all X and tuples indices, and inside the algorithm we would form the tuples by batch
- We could use scikit-learn cross-validation here because we would know in advance all the pairs
Use of Dask arrays
- There we would already know the size of the dataset (the lenght of the Dask Array) (hence could do cross-validation automatically for instance), but these arrays are treated in a batch fashion automatically.

(More details to be added)

@wdevazelhes Basically input_fn is a function that returns a dictionary of features and targets. It is exposed to users so that they can transform their batches of data to avoid feeding the entire dataset at once and to speed up (asynchronous) training with (distributed) data pipeline. It's a function that gets called that provides batches of data to each fit step so that more flexible data types can be fed in, e.g. iterators or streaming input. For input that is not a function, there are utility methods such as numpy_input_fn and pandas_input_fn that can be used as input_fn, e.g.

@terrytangyuan I answer to your comment here, because I guess it is related to this issue. It is interesting that there is this possibility to provide streaming batches in Tensorflow. I've added it to the list above (solution 2.). If I understand correctly the advantage w.r.t solution 1. (scikit-learn's like partial_fit) is that the users wouldn't have to do the loop on batches and partial_fit by themselves right ? they could just call fit(input_fn) with this function as input, and inside, the algorithm would know to call input_fn to generate batches right ? And indeed, this would allow the algorithm to make asynchronous/parallel iterations (which I guess would be difficult with partial_fit because users would need themselves with the multiprocessing etc)

scikit-learn-contrib / metric-learn

Out of core computations #137

Description

Some possibilities for out of core learning: