tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.36k stars 1.92k forks source link

Memory leak and crash, now and 2 years ago, tfjs-node #8326

Open borodadada opened 2 months ago

borodadada commented 2 months ago

The problem has not disappeared anywhere, 2 years ago and now the same thing. Visually, through the task manager in the past there was an information leak, and the process was constantly increasing in memory, now this is not the case, everything is fine, but the result is the same, after a while everything crashes. I managed to take screenshots at the moment when it all started. The screenshots show my working environment, NOT the TEST that I posted, the test itself is as close and simplified as possible, in the future I will post information on this test. I'm not doing anything, just taking screenshots.

There are 4 identical programs running on the computer, 4 copies, one of them begins to fail. This happens when the number of epochs is measured in millions. For test you can run only one copy. On a modern processor, the procedure usually takes 4-6 hours, on an old one more than a day.

Memory leak starts, this starts happening quickly, as can be seen in the screenshot Snipaste_2024-07-06_07-28-22

Process, note 3 other processes, usually from 100 to 200 megabytes in size Snipaste_2024-07-06_07-28-54

Full memory Snipaste_2024-07-06_07-30-42

After Snipaste_2024-07-06_07-30-58

All node js have closed Snipaste_2024-07-06_07-31-53

Logs, there is nothing in them, they are empty, the editor is open Snipaste_2024-07-06_07-37-51

For test simple code, just copy past

TEST CODE

const tf = require('@tensorflow/tfjs-node');

const size = 50
const units = 100

const letsgo = async function(){

    const model = tf.sequential();
    model.add( tf.layers.dense({ inputShape: [units], units, activation: 'linear', useBias: true }));
    model.add( tf.layers.dense({ units, activation: 'linear', useBias: true }));
    model.add( tf.layers.dense({ units, activation: 'linear', useBias: true }));
    model.compile({ optimizer: tf.train.adam(0.005, 0.9, 0.999), loss: tf.losses.absoluteDifference });

    let a = []
    let b = []
    for (let i = 0; i < size; i++) {
        let aa = []
        let bb = []
        for (let ii = 0; ii < units; ii++) {
            aa.push( Math.random() )
            bb.push( Math.random() )
        }
        a.push(aa)
        b.push(bb)
    }

    let xs = tf.tensor2d( a );
    let ys = tf.tensor2d( b );

    await model.fit(xs, ys, {
        epochs: 50000000,
        shuffle: false,
        verbose: 0,
        callbacks:{
            onTrainBegin: ()=>{
                console.log('start')
            },
            onTrainEnd: ()=>{
                console.log('done')
            },
            onEpochEnd: async (epoch, logs)=>{
                if( epoch % 100000 === 0 )
                    console.log(epoch, logs.loss)
            }
        }
    })
}

const loop = async function(){
    for (let i = 0; i < 1; i++) {
        await letsgo()
    }
}

loop()

System information

Okey, this is results from test code:

modern PC intel 13700, crash after 4.4 millions epochs

Snipaste_2024-07-06_12-38-43

old PC intel 3770, crash after 4.4 millions epochs - windows 10 x64 + nodejs 20.10.0

Snipaste_2024-07-06_16-49-30

I can't do the calculations because the program always crashes, and I need many more epochs than here!!! I really hope you fix this, it's a disaster that this bug hasn't been fixed for years!

gaikwadrahul8 commented 2 months ago

Hi, @borodadada

I apologize for the delay in my response and thank you for bringing this issue to our attention and as far I know to avoid memory leak you'll have to use tf.tidy which executes the provided function fn and after it is executed, cleans up all intermediate tensors allocated by fn except those returned by fn. fn must not return a Promise (async functions not allowed). The returned result can be a complex object.

Using this method helps avoid memory leaks. In general, wrap calls to operations in tf.tidy() for automatic memory cleanup.

NOTE: Variables do not get cleaned up when inside a tidy(). If you want to dispose variables, please use tf.disposeVariables() or call dispose() directly on variables please refer tf.dispose.

You can also use tf.memory which returns memory info at the current time in the program.

Could you please give it try after adding tf.tidy and tf.dispose in your code and see memory leak is happening or not ?

If I have missed something here please let me know.

Thank you for your cooperation and patience.

borodadada commented 2 months ago

I don’t understand the principle of a memory leak, all the algorithm needs is to change the coefficients and then feedback and comparison of the result, it should work endlessly without a memory leak. What you wrote has nothing to do with the crash of the program, because the code is as simple as possible and there is nothing in it except the fit function, which crashes. Or I don't understand something.

mightyplow commented 1 month ago

You have to dispose the xs and ys after the fit step. Otherwise tfjs creates new tensors in every loop step. They fill up the memory with each step if you don't dispose them.

borodadada commented 1 month ago

using my example, can you show how it should be? the crash process occurs when the fit function is executed, there is no loop, the data was declared once and after that the learning process started

mightyplow commented 1 month ago

it should be like this

let xs = tf.tensor2d( a );
let ys = tf.tensor2d( b );

await model.fit(xs, ys, {
        epochs: 50000000,
        shuffle: false,
        verbose: 0,
        callbacks:{
            onTrainBegin: ()=>{
                console.log('start')
            },
            onTrainEnd: ()=>{
                console.log('done')
            },
            onEpochEnd: async (epoch, logs)=>{
                if( epoch % 100000 === 0 )
                    console.log(epoch, logs.loss)
            }
        }
})

xs.dispose();
ys.dispose();

This way the tensors get unusable and are freed by tfjs.

By the way that doesn't mean that there isn't any other memory leak. I stumbled across your comment because I'm also hunting a memory issue. But disposing unused tensors will at least wipe out one possible reason.

borodadada commented 1 month ago

I am writing to you about this, that it has not reached this point The program crashes on the - model.fit

await model.fit(xs, ys, {
        epochs: 50000000,
        shuffle: false,
        verbose: 0,
        callbacks:{
            onTrainBegin: ()=>{
                console.log('start')
            },
            onTrainEnd: ()=>{
                console.log('done')
            },
            onEpochEnd: async (epoch, logs)=>{
                if( epoch % 100000 === 0 )
                    console.log(epoch, logs.loss)
            }
        }
})

this part of the code doesn't work because it should fire after the fit function

xs.dispose();
ys.dispose();

if you have the opportunity to run the code, you will see everything for yourself

Now I’ll run the test with your amendments, I’ll write a little later

mightyplow commented 1 month ago

Oh sorry, my fault. I didn't recognize the amount of epochs. Then you're right and it looks like some internal problem. I'll try it out and see what happens.

borodadada commented 1 month ago

Thank you, I'll wait for the result, this problem is bothering me a lot