melodyguan / enas

TensorFlow Code for paper "Efficient Neural Architecture Search via Parameter Sharing"
https://arxiv.org/abs/1802.03268
Apache License 2.0
1.58k stars 390 forks source link

How much memory to run the cifar10 example #112

Open qtz93 opened 5 years ago

qtz93 commented 5 years ago

The following is the print information I executed the command "./scripts/cifar10_macro_search.sh" (progress was killed):

jovyan@219394b7f111$ ./scripts/cifar10_macro_search.sh 
--------------------------------------------------------------------------------
Path outputs exists. Remove and remake.
--------------------------------------------------------------------------------
Logging to outputs/stdout
--------------------------------------------------------------------------------
batch_size...................................................................128
child_block_size...............................................................3
child_cutout_size...........................................................None
child_drop_path_keep_prob....................................................0.6
child_filter_size..............................................................5
child_fixed_arc.............................................................None
child_grad_bound.............................................................5.0
child_keep_prob..............................................................0.9
child_l2_reg.............................................................0.00025
child_lr.....................................................................0.1
child_lr_T_0..................................................................10
child_lr_T_mul.................................................................2
child_lr_cosine.............................................................True
child_lr_dec_every...........................................................100
child_lr_dec_rate............................................................0.1
child_lr_max................................................................0.05
child_lr_min..............................................................0.0005
child_num_aggregate.........................................................None
child_num_branches.............................................................6
child_num_cells................................................................5
child_num_layers..............................................................12
child_num_replicas.............................................................1
child_out_filters.............................................................36
child_out_filters_scale........................................................1
child_skip_pattern..........................................................None
child_sync_replicas........................................................False
child_use_aux_heads.........................................................True
controller_bl_dec...........................................................0.99
controller_entropy_weight.................................................0.0001
controller_forwards_limit......................................................2
controller_keep_prob.........................................................0.5
controller_l2_reg............................................................0.0
controller_lr..............................................................0.001
controller_lr_dec_rate.......................................................1.0
controller_num_aggregate......................................................20
controller_num_replicas........................................................1
controller_op_tanh_reduce....................................................2.5
controller_search_whole_channels............................................True
controller_skip_target.......................................................0.4
controller_skip_weight.......................................................0.8
controller_sync_replicas....................................................True
controller_tanh_constant.....................................................1.5
controller_temperature......................................................None
controller_train_every.........................................................1
controller_train_steps........................................................50
controller_training.........................................................True
controller_use_critic......................................................False
data_format.................................................................NCHW
data_path...........................................................data/cifar10
eval_every_epochs..............................................................1
log_every.....................................................................50
num_epochs...................................................................310
output_dir...............................................................outputs
reset_output_dir............................................................True
search_for.................................................................macro
--------------------------------------------------------------------------------
Reading data
data_batch_1
data_batch_2
data_batch_3
data_batch_4
data_batch_5
test_batch
Prepropcess: [subtract mean], [divide std]
mean: [125.34512 122.94169 113.83898]
std: [63.02383 62.13708 66.74233]
--------------------------------------------------------------------------------
Build model child
Build data ops
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/models.py:83: shuffle_batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.shuffle(min_after_dequeue).batch(batch_size)`.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/input.py:753: QueueRunner.__init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/input.py:753: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/input.py:861: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/tensor_array_ops.py:162: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/models.py:125: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
--------------------------------------------------------------------------------
Building ConvController
--------------------------------------------------------------------------------
Build controller sampler
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_controller.py:157: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_controller.py:158: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_controller.py:238: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
--------------------------------------------------------------------------------
Build train graph

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_child.py:578: average_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.average_pooling2d instead.
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_child.py:581: max_pooling2d (from tensorflow.python.layers.pooling) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.max_pooling2d instead.
Tensor("child/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_1/skip/bn/Identity:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_2/skip/bn/Identity:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child/layer_3/pool_at_3/from_4/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_4/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_5/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_6/skip/bn/Identity:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child/layer_7/pool_at_7/from_8/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_8/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_9/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_10/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child/layer_11/skip/bn/Identity:0", shape=(?, 36, 8, 8), dtype=float32)
WARNING:tensorflow:From /home/jovyan/sources/enas/src/cifar10/general_child.py:233: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Model has 697860 params
--------------------------------------------------------------------------------
Build valid graph
Tensor("child_1/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_1/layer_1/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_1/layer_2/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_1/layer_3/pool_at_3/from_4/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_1/layer_4/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_1/layer_5/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_1/layer_6/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_1/layer_7/pool_at_7/from_8/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_1/layer_8/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_1/layer_9/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_1/layer_10/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_1/layer_11/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
--------------------------------------------------------------------------------
Build test graph
Tensor("child_2/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_2/layer_1/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_2/layer_2/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_2/layer_3/pool_at_3/from_4/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_2/layer_4/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_2/layer_5/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_2/layer_6/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_2/layer_7/pool_at_7/from_8/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_2/layer_8/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_2/layer_9/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_2/layer_10/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_2/layer_11/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
--------------------------------------------------------------------------------
Build valid graph on shuffled data
Tensor("child_3/layer_0/case/cond/Merge:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_3/layer_1/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_3/layer_2/skip/bn/FusedBatchNorm:0", shape=(?, 36, 32, 32), dtype=float32)
Tensor("child_3/layer_3/pool_at_3/from_4/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_3/layer_4/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_3/layer_5/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_3/layer_6/skip/bn/FusedBatchNorm:0", shape=(?, 36, 16, 16), dtype=float32)
Tensor("child_3/layer_7/pool_at_7/from_8/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_3/layer_8/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_3/layer_9/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_3/layer_10/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
Tensor("child_3/layer_11/skip/bn/FusedBatchNorm:0", shape=(?, 36, 8, 8), dtype=float32)
--------------------------------------------------------------------------------
<tf.Variable 'controller/lstm/layer_0/w:0' shape=(128, 256) dtype=float32_ref>
<tf.Variable 'controller/g_emb:0' shape=(1, 64) dtype=float32_ref>
<tf.Variable 'controller/emb/w:0' shape=(6, 64) dtype=float32_ref>
<tf.Variable 'controller/softmax/w:0' shape=(64, 6) dtype=float32_ref>
<tf.Variable 'controller/attention/w_1:0' shape=(64, 64) dtype=float32_ref>
<tf.Variable 'controller/attention/w_2:0' shape=(64, 64) dtype=float32_ref>
<tf.Variable 'controller/attention/v:0' shape=(64, 1) dtype=float32_ref>
WARNING:tensorflow:From /home/jovyan/sources/enas/src/utils.py:231: SyncReplicasOptimizer.__init__ (from tensorflow.python.training.sync_replicas_optimizer) is deprecated and will be removed in a future version.
Instructions for updating:
The `SyncReplicaOptimizer` class is deprecated. For synchrononous training, please use [Distribution Strategies](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/data_flow_ops.py:1294: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
--------------------------------------------------------------------------------
Starting session
2019-11-12 08:28:15.808271: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-12 08:28:16.940255: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x557a92d62660 executing computations on platform CUDA. Devices:
2019-11-12 08:28:16.940310: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-11-12 08:28:16.940324: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1
2019-11-12 08:28:17.072808: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3597780000 Hz
2019-11-12 08:28:17.073571: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x557a92d56010 executing computations on platform Host. Devices:
2019-11-12 08:28:17.073612: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-11-12 08:28:17.073827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:05:00.0
totalMemory: 10.91GiB freeMemory: 5.08GiB
2019-11-12 08:28:17.073934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:09:00.0
totalMemory: 10.92GiB freeMemory: 5.21GiB
2019-11-12 08:28:17.074477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1
2019-11-12 08:28:17.092715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-12 08:28:17.092752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 1 
2019-11-12 08:28:17.092768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N Y 
2019-11-12 08:28:17.092779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1:   Y N 
2019-11-12 08:28:17.093006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4900 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
2019-11-12 08:28:17.093536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 5038 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)
WARNING:tensorflow:From /opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:809: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
./scripts/cifar10_macro_search.sh: line 40: 10644 Killed                  python src/cifar10/main.py --data_format="NCHW" --search_for="macro" --reset_output_dir --data_path="data/cifar10" --output_dir="outputs" --batch_size=128 --num_epochs=310 --log_every=50 --eval_every_epochs=1 --child_use_aux_heads --child_num_layers=12 --child_out_filters=36 --child_l2_reg=0.00025 --child_num_branches=6 --child_num_cell_layers=5 --child_keep_prob=0.90 --child_drop_path_keep_prob=0.60 --child_lr_cosine --child_lr_max=0.05 --child_lr_min=0.0005 --child_lr_T_0=10 --child_lr_T_mul=2 --controller_training --controller_search_whole_channels --controller_entropy_weight=0.0001 --controller_train_every=1 --controller_sync_replicas --controller_num_aggregate=20 --controller_train_steps=50 --controller_lr=0.001 --controller_tanh_constant=1.5 --controller_op_tanh_reduce=2.5 --controller_skip_target=0.4 --controller_skip_weight=0.8 "$@"

My environment is configured as follows: System: CentOS7 Graphics driver: NVIDIA GeForce GTX 1080 Ti * 2 Memory: 16G TensorFlow: 1.13.1 GPU Python:python3.6

qtz93 commented 5 years ago

???

mkamein commented 4 years ago

Hey! I thinking I am facing a similar problem. My script gets killed after "starting session" section but no actual errors are displayed. Did you manage to solve this problem?

racheljose21 commented 4 years ago

Does anyone have a solution? I'm facing the same issue

nott0 commented 4 years ago

I believe that you don't have enough memory. Try increase it to at least 32GB. Or reduce the number ofv tanning data in data_util.py