tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.47k stars 1.93k forks source link

Learning data too big to fit in memory at once, how to learn? #7801

Open Sir-hennihau opened 1 year ago

Sir-hennihau commented 1 year ago

I have the problem that my dataset became too large to fit in memory at once in tensorflow js. What are good solutions to learn from all data entries? My data comes from a mongodb instance and needs to be loaded asynchronously.

I tried to play with generator functions, but couldnt get async generators to work yet. I was also thinking that maybe fitting the model in batches to the data would be possible?

It would be great if someone could provide me with a minimal example on how to fit on data that is loaded asynchronously through either batches or a database cursor.

For example when trying to return promises from the generator, I get a typescript error.

    const generate = function* () {
        yield new Promise(() => {});
    };

    tf.data.generator(generate);

Argument of type '() => Generator<Promise<unknown>, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

Also, you cant use async generators. This is the error that would happen if you try to:

tf.data.generator(async function* () {})

throws

Argument of type '() => AsyncGenerator<any, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

gaikwadrahul8 commented 1 year ago

Hi, @Sir-hennihau

Thank you for bringing this issue to our attention and As far I know, you can use tf.data.generator or tf.data.Dataset with either .batch or .prefetch and also refer this answer from stack overflow so could you please give it try and let us know whether is it resolving your issue or not ?

If issue still persists please let us know ? Thank you!

Sir-hennihau commented 1 year ago

Hey @gaikwadrahul8 , I tried to play around with the functions that you suggested but couldnt find success yet unfortunately.

The code snippet from stackoverflow results in a typescript error, because it says that async generators are not assignable. I tried to play around a bit with //@ts-ignore, but couldnt get it to work yet. I also can't find an example in the documentation nor online where the dataset is populated by loading data from the network using async await.

Just for completeness, the snippet from stackoverflow

const dataset = tf.data.generator(async function* () {
    const dataToDownload = await fetch(/* ... */);
    while (/* ... */) {
        const moreData = await fetch(/* ... */);
        yield otherData;
    }
});

throws Argument of type '() => AsyncGenerator<any, void, unknown>' is not assignable to parameter of type '() => Iterator<TensorContainer, any, undefined> | Promise<Iterator<TensorContainer, any, undefined>>'.

At that point, I don't even know if the implementation or the typings are wrong.

Can you maybe take a look and try to produce a minimal working example where the dataset uses async await loaded data from some remote source? Would be highly appreciated to move this problem forward.

Sir-hennihau commented 11 months ago

Push, I still didn't find a satisfying solution to that problem yet :s @mattsoulanille :D

Antony-Lester commented 9 months ago

@Sir-hennihau I am just started facing the same issue,

(linking mongodb to tensorflow with batched data using typscript tfjs-node-gpu) as i have started hitting the v8 heap limit

if i find a solution/work around i will share within a week or so.

Sir-hennihau commented 9 months ago

@Antony-Lester any news?

Antony-Lester commented 9 months ago

@Sir-hennihau only working solution i have found so far is to move to Incremental Batch Training, cant speak about its accuracy as i don't have a baseline for comparison.

so holding all of the validation data and one chunks worth of training data in heap memory.

```const {trainCount} = await countDataPoints(db)
 const batchSize = Math.ceil(trainCount / 100)
 const validationDataResult = await validationData(db)
 const totalBatches = Math.ceil(trainCount / batchSize)
 const trainDataPipelineArray = await trainDataPipeline(db)
 // Train model incrementally
 for (let i = 0; i < trainCount; i += batchSize) {
        const batchPipeline = [...trainDataPipelineArray, { $skip: i }, { $limit: batchSize }]
        const data = await db.collection('myCollection').aggregate(batchPipeline).toArray()
        const metricsData = data.map(item => item.metrics)
        const xs = tf.tensor2d(metricsData, [metricsData.length, metricsData[0].length])
        const resultData = data.map(item => item.result)
        const ys = tf.tensor2d(resultData, [resultData.length, 1])
        await model.fit(xs, ys, {
            epochs: epochs,
            validationData: validationDataResult,
            shuffle: true,
            batchSize: 64,
            callbacks: [],
            verbose: 1,
        })


From copilot:
Advantages of Incremental Batch Training:

Memory Efficiency: It's more memory-efficient as it only needs to load a small batch into memory, which is beneficial when dealing with large datasets that can't fit into memory.

Speed: It can lead to faster convergence because the model parameters are updated more frequently.

Noise: The noise in the gradient estimation can sometimes help escape shallow local minima, leading to better solutions.

Real-time Learning: It allows the model to learn from new data on-the-go without retraining from scratch.

Disadvantages of Incremental Batch Training:

Less Accurate Gradient Estimation: The gradient estimation can be less accurate because it's based on fewer examples.

Hyperparameter Sensitivity: It's more sensitive to the choice of learning rate and batch size.

Less Stable: The cost function is not guaranteed to decrease every step, and the final parameters can depend on the initial parameters (i.e., the solution can be non-deterministic).
Sir-hennihau commented 7 months ago

thanks @Antony-Lester , in the meantime i went to first convert my data in a csv file and then use the csv learning methods from tfjs. its a shame that seems to be needed. csv learning seems to be nicely implemented, though. on very large datasets this seems to be very storage inefficient, though.

Sir-hennihau commented 5 months ago

it would be anyways nice to get an answer from the maintainers on how to solve the issue without using workarounds like convertig to a dataset first

Antony-Lester commented 3 months ago

in the end, I spawned off python scripts that trained the model while watching the scripts console output. not ideal but I can use the whole memory now.

tharvik commented 1 month ago

you can in fact simply ts-ignore the async generator, it is internally supported.

  // @ts-expect-error
  tf.data.generator(async function* () { … });

openned #8408 to expose it.