Open wdevazelhes opened 5 years ago
@wdevazelhes Basically input_fn is a function that returns a dictionary of features and targets. It is exposed to users so that they can transform their batches of data to avoid feeding the entire dataset at once and to speed up (asynchronous) training with (distributed) data pipeline. It's a function that gets called that provides batches of data to each fit step so that more flexible data types can be fed in, e.g. iterators or streaming input. For input that is not a function, there are utility methods such as numpy_input_fn and pandas_input_fn that can be used as input_fn, e.g.
@terrytangyuan I answer to your comment here, because I guess it is related to this issue. It is interesting that there is this possibility to provide streaming batches in Tensorflow. I've added it to the list above (solution 2.). If I understand correctly the advantage w.r.t solution 1. (scikit-learn's like partial_fit
) is that the users wouldn't have to do the loop on batches and partial_fit
by themselves right ? they could just call fit(input_fn)
with this function as input, and inside, the algorithm would know to call input_fn
to generate batches right ? And indeed, this would allow the algorithm to make asynchronous/parallel iterations (which I guess would be difficult with partial_fit
because users would need themselves with the multiprocessing etc)
Description
One next step for the project is to be able to do out of core learning/predicting: i.e. being able to fit/predict models when we cannot load the data all at once but only by small batches, because the data is too big for the memory.
Some possibilities for out of core learning:
partial_fit
function, and make batches manually and do for loopsfor all batch: estimator.partial_fit(batch)
.fit
, load one batch at a time (by calling this function) for each iteration. see comment below https://github.com/metric-learn/metric-learn/issues/137#issuecomment-446906001(More details to be added)