Closed harrytrinh2 closed 5 years ago
Hi @TrinhDinhPhuc,
Would you mind sharing with us some more details like:
Thanks.
Hi I am using Ubuntu 18.04, python 3.6, this is my configuration file
[ { "name": "census", "num_random_search": 10, "train_csv": "data/census-train.csv", "continuous_cols": [0, 2, 3, 4, 5], "epoch": 5, "steps_per_epoch": 10000, "output_epoch": 3, "sample_rows": 10000 } ]
I just simply run $python3.6 src/launcher.py demo_config.json as the instruction in README. After 4 hours, it was training epoch 4
but suddenly, it showed these lines:
[0327 14:23:24 @sessinit.py:90] WRN The following variables are in the checkpoint, but not found in the graph: global_step:0, optimize/beta1_power:0, optimize/beta2_power:0
25%|###############2 |2498/10000[27:36<1:08:27, 1.83it/s]2019-03-27 14:23:24.578021: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[0327 14:23:24 @sessinit.py:117] Restoring checkpoint from train_log/TGAN_synthesizer:KDD-2/model-50000 ...
25%|###############3 |2523/10000[27:50<1:07:24, 1.85it/s][0327 14:23:39 @logger.py:74] Argv: src/TGAN_synthesizer.py --batch_size 50 --z_dim 100 --num_gen_rnn 400 --num_gen_feature 300 --num_dis_layers 4 --num_dis_hidden 400 --learning_rate 0.001 --noise 0.1 --exp_name KDD-3 --max_epoch 5 --steps_per_epoch 10000 --data expdir/KDD/train.npz --gpu 0
25%|###############3 |2524/10000[27:51<1:11:50, 1.73it/s][0327 14:23:39 @develop.py:96] WRN [Deprecated] ModelDescBase._get_inputs() interface will be deprecated after 30 Mar. Use inputs() instead!
[0327 14:23:39 @input_source.py:221] Setting up the queue 'QueueInput/input_queue' for CPU prefetching ...
[0327 14:23:39 @develop.py:96] WRN [Deprecated] ModelDescBase._build_graph() interface will be deprecated after 30 Mar. Use build_graph() instead!
[0327 14:23:39 @registry.py:121] gen/LSTM/00/FC input: [50, 400]
[0327 14:23:39 @registry.py:129] gen/LSTM/00/FC output: [50, 300]
[0327 14:23:39 @registry.py:121] gen/LSTM/00/FC2 input: [50, 300]
[0327 14:23:39 @registry.py:129] gen/LSTM/00/FC2 output: [50, 1]
WARNING:tensorflow:From src/TGAN_synthesizer.py:71: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead
then, it automatically back to epoch 1
discrim/dis_fc_top/W:0 [410, 1] 410
discrim/dis_fc_top/b:0 [1] 1
Total #vars=92, #params=3934650, size=15.01MB
[0327 14:23:46 @base.py:187] Setup callbacks graph ...
[0327 14:23:46 @summary.py:38] Maintain moving average summary of 6 tensors in collection MOVING_SUMMARY_OPS.
[0327 14:23:46 @summary.py:75] Summarizing collection 'summaries' of size 9.
[0327 14:23:46 @graph.py:91] Applying collection UPDATE_OPS of 16 ops.
25%|###############4 |2540/10000[27:59<1:08:34, 1.81it/s][0327 14:23:48 @base.py:205] Creating the session ...
Here is the error:
2019-03-27 23:03:06.579435: W tensorflow/core/kernels/queue_base.cc:277] _0_QueueInput/input_queue: Skipping cancelled enqueue attempt with queue not closed
60%|########################################################################################################4 |3018/5000[46:27<25:53, 1.28it/s]Traceback (most recent call last):
File "src/TGAN_synthesizer.py", line 313, in <module>
sample(args.sample, Model(), args.load, output_filename=args.output)
File "src/TGAN_synthesizer.py", line 234, in sample
session_init=get_model_loader(model_path),
File "/home/harry/Documents/GANs-demo/TGAN-master/py36_env/lib/python3.6/site-packages/tensorpack/tfutils/sessinit.py", line 262, in get_model_loader
return SaverRestore(filename)
File "/home/harry/Documents/GANs-demo/TGAN-master/py36_env/lib/python3.6/site-packages/tensorpack/tfutils/sessinit.py", line 107, in __init__
model_path = get_checkpoint_path(model_path)
File "/home/harry/Documents/GANs-demo/TGAN-master/py36_env/lib/python3.6/site-packages/tensorpack/tfutils/varmanip.py", line 182, in get_checkpoint_path
assert tf.gfile.Exists(model_path) or tf.gfile.Exists(model_path + '.index'), model_path
AssertionError: train_log/TGAN_synthesizer:KDD2-2/model-0
my config file: [ { "name": "KDD2", "num_random_search": 10, "train_csv": "data/KDD2.csv", "continuous_cols": [0, 2, 3, 4, 5], "epoch": 2, "steps_per_epoch": 5000, "output_epoch": 3, "sample_rows": 5000 } ]
Hi @TrinhDinhPhuc,
Regarding your first question:
Why this problem occured? Please explain it to me
There is no problem, nor the model was retrained, let's see what happened:
[
{
"name": "census",
"num_random_search": 10, # num_random_search: iterations of random hyper parameter search.
"train_csv": "data/census-train.csv",
"continuous_cols": [0, 2, 3, 4, 5],
"epoch": 5,
"steps_per_epoch": 10000,
"output_epoch": 3,
"sample_rows": 10000
}
]
You are running 10 parallel random searches of model hyperparameters, according to the parameter num_random_search
. That is, training and evaluating different model instances with different sets of hyperparameters, to find the best ones for your given dataset. And this message:
[0327 14:22:41 @base.py:264] Training has finished!
Is just that one of the training cycles of the random search has finished, and the following output from when starting the next iteration in the hyperparameter search loop is what I think may have lead you to think that the model was being retrained.
Regarding your second question:
The issue here is that an experiment can't be run twice with the same name.
Hi, based on your explanation. I modified my config file like this to check the code: [ { "name": "KDD2", "num_random_search": 1, "train_csv": "data/KDD2.csv", "continuous_cols": [0, 2, 3, 4, 5], "epoch": 1, "steps_per_epoch": 10, "output_epoch": 1, "sample_rows": 5 } ] However, the script was like this, it was not running for a long time, I dont understand why it was Restoring checkpoint and why it took so much time:
[0330 12:43:42 @registry.py:129] discrim/dis_fc_top output: [50, 1]
[0330 12:43:42 @collection.py:145] New collections created in tower : tf.GraphKeys.REGULARIZATION_LOSSES
[0330 12:43:42 @collection.py:164] These collections were modified but restored in : (tf.GraphKeys.SUMMARIES: 0->2)
[0330 12:43:42 @sessinit.py:90] WRN The following variables are in the checkpoint, but not found in the graph: global_step:0, optimize/beta1_power:0, optimize/beta2_power:0
2019-03-30 12:43:42.856889: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[0330 12:43:43 @sessinit.py:117] Restoring checkpoint from train_log/TGAN_synthesizer:KDD2-0/model-10 ...
I check the code, It was freezed at this line in launcher.py
file. Do you know why?
pool.map(worker, commands)
As you can see here, it seems like it was freezing at this line sample(args.sample, Model(), args.load +".index", output_filename=args.output)
. In the picture, it was Restoring checkpoint from train_log/TGAN_synthesizer:KDD2-0/model-10 ...
but in the folder, there was no file model-10
file like that. I think the correct path is model-10.index
not model-10
right?
Hi @TrinhDinhPhuc,
Regarding your first question:
I check the code, It was freezed at this line in launcher.py file. Do you know why?
Seeing your configuration file and the code, there is something that may cause trouble:
[
{
...
"sample_rows": 5
}
]
We just found out that there is a bug which prevents TGAN from working properly when sampling a number of rows that is not an exact multiple of the batch_size
.
To work around this problem, and since the only possible batch sizes that TGAN can use right now are 50, 100 and 200, please make sure to always request a number of rows to sample that is an exact multiple of 200.
For the second question:
As you can see here, it seems like it was freezing at this line
sample(args.sample, Model(), args.load +".index", output_filename=args.output)
.
Well, I can't see it from your screenshots, but this one of the potential consequences from the bug explained above.
I think the correct path is model-10.index not model-10 right?
Yes, indeed.
Also, I'm not sure if you're aware but any synthesized data will be stored in TGAN-master/exp_dir/KDD2/
After 4 hours of training, this is the log:
The model was automatically retrained and it showed: