Log mechanism discussion

cruiseliu commented 2 years ago

Current Log Behavior

Due to a bug, Experiment log is written to dispatcher.log.

Component	Log Behavior	Log Level	Detected by	Note
Experiment management API	Log → stdout & dispatcher.log	Aware `debug`	Explicit setup	`nni.*` are colorful
nnictl (create, resume, view)	Log → stdout & dispatcher.log	Aware `debug`	Explicit setup	Same as Experiment
nnictl (Other commands)	Log → stdout	INFO	(default)	Never used
Trials (local)	Log & stdout → trial.log	INFO	NNI_PLATFORM
Trials (reuse)	Log → stdout	INFO	REUSE_MODE	Same as default
Standalone (debug trial)	Log → stdout	INFO	(default)
Tuner	Log → dispatcher.log	Aware `logLevel`	SDK_PROCESS
NAS main process	Log → stdout	INFO	(default)
NAS multi-trial trials	Same as HPO trials	INFO	NNI_PLATFORM
Compression	Log → stdout	INFO	(default)

Expected Log Behavior

Component	Log Behavior	Log Level	Detected by
Experiment management API	NNI log → stdout & experiment.log	Aware `logLevel`	Explicit setup
NAS main process	Same as experiment	-	-
nnictl (create, resume, view)	Same as experiment	-	-
nnictl (Other commands)	Does not use logging; don't touch	-	-
Trials (local)	All log → trial.log & stdout → trial.stdout	Aware `logLevel`?	NNI_PLATFORM
NAS multi-trial trials	Same as HPO trials	-	-
Standalone (debug trial)	NNI log → stdout	INFO	Trial APIs
Tuner	All log → dispatcher.log	Aware `logLevel`	SDK_PROCESS
Compression	?	?	?

NNI manager log: ?

QuanluZhang commented 2 years ago

[ ] double check stdout stderr (log) of trial in remote mode and all other training services
- [ ] whether all trials' log are written in one log file?
[ ] double check how to avoid redirecting stdout stderr to log.
[ ] survey how to output training process, using print or logging?
[ ] start from "NAS main process": experiment setup log handler for nni.

liuzhe-lz commented 2 years ago

Discussion iteration 1 conclusions:

Log files

Each HPO experiment writes 3 files:

~/nni-experiments/EXPERIMENT-ID/logs/experiment.log
~/nni-experiments/EXPERIMENT-ID/logs/nnimanager.log
~/nni-experiments/EXPERIMENT-ID/logs/dispatcher.log

Each NAS multi-trial experiment writes 2 files:

~/nni-experiments/EXPERIMENT-ID/logs/experiment.log
~/nni-experiments/EXPERIMENT-ID/logs/nnimanager.log

Each NAS oneshot experiment writes 1 file:

~/nni-experiments/EXPERIMENT-ID/logs/experiment.log

Auto-compress works as HPO; other compression experiments do not write log files.

Log content

A log message should be written to stdout, if and only if:

Its logger name starts with nni.
Its log level is INFO or above

A log message should be written to "experiment.log", if and only if:

Its logger name starts with nni.exp_ID. (the format of "exp_ID" is not yet decided)
Its log level is experiment.config.log_level or above

A log message should be written to "dispatcher.log", if and only if:

Its logger name starts with nni.
Its log level is experiment.config.log_level or above

Above rules apply to all Python modules.

How to access experiment ID

Ideally a module "inside" experiment (like NAS strategy) should access experiment ID via experiment.logger, but if it does not have a reference to the experiment, there will be a way to inference automatically.

We assume each thread only runs one experiment at a time. There will be a stack or priority queue for each thread, maintaining most recently activated experiment in current thread. The stack top is considered "current experiment".

If a strategy (or something similar) is guaranteed to run in a separate thread, the inference should be reliable; otherwise it should be considered last resort. In short, it depends on RetiariiExperiment implementation.

Trial log

Each trial have 3 output files:

stdout
stderr
trial.log

Log file should only contain logger name starts with nni.. The log level is not yet discussed.

microsoft / nni