microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.02k stars 1.81k forks source link

Node server crash #5624

Open XiaoXiao-Woo opened 1 year ago

XiaoXiao-Woo commented 1 year ago

image Besides, when I use nni to connect another machine (it can connect itself with "remote" platform), the same problem occurs: "FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory"

Environment:

Configuration:

Log message:

How to reproduce it?:

studywolf commented 1 year ago

I'm also getting this error consistently when the number of trials in an experiment gets about about 40,000. Has happened on > 5 different experiments.

NNI version: master
Training service (local|remote|pai|aml|etc): local, and reusemode=False
Client OS: ubuntu 22.04.2 LTS
Server OS (for remote mode only): n/a
Python version: 3.7.4
PyTorch/TensorFlow version: n/a
Is conda/virtualenv/venv used?: conda
Is running in Docker?: no
liuzhe-lz commented 1 year ago

You can set export NODE_OPTIONS="--max_old_space_size=8192" for a quick fix.

studywolf commented 1 year ago

ah, thanks for the suggestion! It spools up the run when I resume but immediately fails on me silently

studywolf commented 1 year ago

any progress on this?

studywolf commented 1 year ago

checking in again

studywolf commented 1 year ago

You can set export NODE_OPTIONS="--max_old_space_size=8192" for a quick fix.

unfortunately, still getting the heap out of memory error after 29k trials. this is in nni 3.0