Currently DeText logger logs both network structure and evaluation metrics in the following manner:
For single node training, there's one logging.txt that has all logs
For parameter server (worker + evaluator) training, logging.txt logs the network related while eval_log.txt logs the evaluation related from the evaluator.
This PR consolidates the logging method and better arranges the logging output:
Network related (trainable variables, their shapes, and total deep parameters) are logged in network_structure.txt
Evaluation logs will be in eval_log.txt for both single node training and ps multi worker training.
Fixes # (issue)
AIAF-365
Type of change
Please delete options that are not relevant.
[x] New feature (non-breaking change which adds functionality)
List all changes
Please list all changes in the commit.
Removed unnecessary prints
Use network_structure.txt file for logging variable related outputs
Removed logging.txt as the useful logging is moved to network_structure.txt
Tidied eval logging in best_checkpoint_copier.py
Testing
Tested the training flow with run_detext.sh. Logging files listed below:
$ ls *.txt
eval_log.txt network_structure.txt
$ cat eval_log.txt
***** Evaluation on dev set during training *****
## Step 2
loss : 1.2556911706924438
Checking checkpoint model.ckpt-2
keeping checkpoint model.ckpt-2 with metric/ndcg@10 = 0.7103099226951599
## Step 10
loss : 0.9746564030647278
Checking checkpoint model.ckpt-10
keeping checkpoint model.ckpt-10 with metric/ndcg@10 = 1.0
removing old checkpoint model.ckpt-2 with metric/ndcg@10 = 0.7103099226951599
***** Training finished. *****
***** Evaluation on test set with best exported model: *****
global_step = 10
loss = 0.9746564
metric/ndcg@10 = 1.0
Description
Currently DeText logger logs both network structure and evaluation metrics in the following manner:
logging.txt
that has all logslogging.txt
logs the network related whileeval_log.txt
logs the evaluation related from the evaluator.This PR consolidates the logging method and better arranges the logging output:
network_structure.txt
eval_log.txt
for both single node training and ps multi worker training.Fixes # (issue) AIAF-365
Type of change
Please delete options that are not relevant.
List all changes
Please list all changes in the commit. Removed unnecessary prints Use
network_structure.txt
file for logging variable related outputs Removedlogging.txt
as the useful logging is moved tonetwork_structure.txt
Tidied eval logging inbest_checkpoint_copier.py
Testing
Tested the training flow with
run_detext.sh
. Logging files listed below:Test Configuration:
Checklist