tensorflow / tfjs

A WebGL accelerated JavaScript library for training and deploying ML models.
https://js.tensorflow.org
Apache License 2.0
18.29k stars 1.92k forks source link

knn-classifier memory leak (peak bytes) #8159

Open grant-wilson opened 5 months ago

grant-wilson commented 5 months ago

System information

Describe the current behavior Running predictClass for first time with "many" example classes consumes all memory and kills node process.

Describe the expected behavior Running predictClass for first time with "many" example classes will consume fewer peak bytes during processing, allowing scripts to load more classes (depends on specific usage).

Standalone code to reproduce the issue https://stackblitz.com/edit/stackblitz-starters-gnuzh5?file=index.js

Other info / logs Include any logs or source code that would be helpful to Run locally I am seeing peak bytes grow quickly. With 16GiB RAM it cannot complete with 1000 classes. image

Proposed Fix I think the issue is in https://github.com/tensorflow/tfjs-models/blob/3a8b906f5a95f8a2d697c3896f20b7529cdeaf6f/knn-classifier/src/index.ts#L106-L109

        for (const label in this.classDatasetMatrices) {
          newTrainLogitsMatrix = concatWithNulls(
              newTrainLogitsMatrix, this.classDatasetMatrices[label]);
        }

Each concatWithNulls allocates a new tensor with the previous concatenation remaining in memory. Although this is called within a tidy so the memory will be released after the similarities have been computed, during computation a large footprint is required.

Instead, the tensors could be disposed of in the loop.

        for (const label in this.classDatasetMatrices) {
          const newTrainLogitsMatrixToBeDisposed = newTrainLogitsMatrix;
          newTrainLogitsMatrix = concatWithNulls(
              newTrainLogitsMatrix, this.classDatasetMatrices[label]);
          newTrainLogitsMatrixToBeDisposed?.dispose();
        }

With that change the peak bytes look much better. image

Although, better performance could probably be found by replacing the concat operation entirely....

grant-wilson commented 5 months ago

I am happy to create a pull request with tests if you're ok with my proposed approach.

gaikwadrahul8 commented 5 months ago

Hi, @grant-wilson

We deeply appreciate your patience and bringing the knn-classifier memory leak issue to our attention. We apologize for the delayed response. If you have discovered a potential solution or workaround, we highly encourage you to submit a pull request to our repository. Our team will thoroughly review your contribution and if it successfully addresses the issue without introducing new concerns will be happy to merge your pull request.

We value your collaboration in resolving this matter and look forward to seeing your potential solution. Thank you.

grant-wilson commented 5 months ago

I've created a PR. https://github.com/tensorflow/tfjs-models/pull/1345

grant-wilson commented 2 months ago

Any thoughts on merging my PR?